1 Introduction

On Trendyol, there are lots of products sold every day. To estimate the selling amount of the products will benefit companies to control the inventory, pricing and listing strategies, and more. In our project, we will try to estimate nine products from Trendyol starting from June 11, 2021, until June 25, 2021. These products are from various categories and examined individually in our project.

To make our prediction, we need to provide the next day’s estimation using the data from a day before. That means that every day, we made predictions for two days later. New data points were adding to the system. We were updating the models before the submissions of each data every day.

In the given data starting from 25.05.2020, the provided statistics are average price of the sold products, number of items sold, number of visits, the added count of the product to favorites and basket, number of items sold in that category, and the brand’s particular category, number of visits of the category, and visits on the Trendyol website. Some of these variables were not accurate, for example, visit on the Website and this data is not used in the data analysis. The data was provided daily however, there are omitted variables on certain days which also taken into consideration while analyzing.

To improve our submissions, we examine Trendyol Website every day to understand the positioning, the price, campaigns, and competitors of each product. We adjusted the predictions of the models considering all other factors that can be found on the Website apart from the dataset. We take into consideration the stock out cases and prices of products from varying sellers.

To make the predictions, we build our models using Auto-Regressive Integrated Moving Average(ARIMA), Auto-Regressive Integrated Moving Average with Exogeneous Input (ARIMAX), and Dynamic Regression methods. We also use the naive estimations for comparative reasons. Our predictions are obtained by combining the well-performing methods by using weighted averages considering our observations from Trendyol Website.

Our rankings were evaluated according to WMAPE (Weighted Mean Absolute Percentage Error) value. This method balances the error giving a higher weight to the products that sell more.

In our testing phase, in order to compare the models, we made our predictions with two days difference between the train and test data as in the actual case and use the WMAPE for comparison.

3 Approach

Series of different products have different needs. Since we are predicting the next day given the data up to yesterday, we avoided using predictors such as favored_count or category_sold. Because if we used them, we would also have to forecast them, or simply use the latest value, or lagged values, which is not so favorable. Instead, we used price as a predictor, because it is an attribute that we can check from the website whenever we want; actually price is a predictor that helps us follow the data from one day ahead, instead of two days ahead.

Our main approach was to plot the sold_count over time, check the correlation between prices and sales, check the acf and pacf plots, build suitable models, and take the averages of forecasts of model and revise them if necessary.

3.1 Xiaomi

load("C:/Users/alpsr/Desktop/Project/Project Final/xiaomi_environment.RData")

The first product to be forecasted is the bluetooth headphones of Xiaomi. It is one of the highest selling products compared to other products considered in this project. Its price changes frequently and discounts have a significant effect on its sales.

3.1.1 Descriptive Analysis

# ggplot(xiaomi) +
#   geom_boxplot(aes(y = sold_count))
ggplot(xiaomi, aes(x = event_date)) +
  geom_line(aes(y = sold_count))

Time series of the product sales fluctuates significantly. This also confirms that the sales volume is highly dependent to the price of the product.Also, the variance of the series seems to be constant.

ggplot(xiaomi) +
  geom_histogram(aes(x = sold_count))

Histogram of the sales data shows that the distribution of sales is close to the normal distribution. However, the distribution is right-skewed (distribution has a long right-tail) meaning that there are some days with very high sales volume relative to the mean of the distribution.

acf(xiaomi$sold_count, lag.max = 40) #Autoregressive signature sinusodial/ expo decreasing --> maybe seasonality

pacf(xiaomi$sold_count, lag.max = 40)

The sales data has very high autocorrelations up to lag 9. After that point, autocorrelation increases around lag 30 which suggests a monthly seasonality. In the partial autocorrelation plot, we see that there is a spike in every eight to 9 days.

3.1.2 Alternative Models

3.1.2.1 Dynamic Regression

ggpairs(xiaomi[,c(4,17:21)]) # visit count lag, price, basket lag, cat sold lag, favored_count lag, cat favored lag

# summary(m11)
# AIC(m11)

In the dynamic regression approach, all variables except price are used with lag 2 as we only have access data from two days before. Price data from the previous day is available. As it can be from the plot above, all variables are correlated to a some degree with the sales data to some degree and can be used in the linear regression. The first linear regression model considered included trend, month and day of the week variables and its results were as follows.

summary(m12)
## 
## Call:
## lm(formula = sold_count ~ trend + month + wday, data = xiaomi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -325.26 -107.26  -20.65   77.75  567.18 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  495.6623    35.1939  14.084  < 2e-16 ***
## trend         -0.6502     0.1858  -3.500 0.000527 ***
## month.L     -208.5217    41.4897  -5.026 8.11e-07 ***
## month.Q      -27.9565    38.2786  -0.730 0.465681    
## month.C      178.6987    40.7820   4.382 1.57e-05 ***
## month^4      137.5321    36.7290   3.745 0.000212 ***
## month^5      -48.5198    30.5394  -1.589 0.113041    
## month^6      -16.2385    35.1415  -0.462 0.644312    
## month^7      -23.0712    28.5759  -0.807 0.420019    
## month^8     -125.1312    32.0106  -3.909 0.000112 ***
## month^9      135.9287    29.6439   4.585 6.37e-06 ***
## month^10      38.4296    29.6143   1.298 0.195277    
## month^11    -100.0228    30.8838  -3.239 0.001319 ** 
## wday.L        -5.8513    21.9066  -0.267 0.789550    
## wday.Q       -64.0097    21.9596  -2.915 0.003793 ** 
## wday.C         8.4399    21.8682   0.386 0.699778    
## wday^4       -11.2072    21.8274  -0.513 0.607973    
## wday^5         2.1307    21.8138   0.098 0.922247    
## wday^6         7.6173    21.8854   0.348 0.728015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156.7 on 341 degrees of freedom
## Multiple R-squared:  0.3085, Adjusted R-squared:  0.272 
## F-statistic: 8.452 on 18 and 341 DF,  p-value: < 2.2e-16
AIC(m12)
## [1] 4681.203

In the second model, external regressors are added to the linear regression model. These variables include basket count, category sold, category visits, and category favored with lag 2. Also, price variable is included with lag 2. Thus, model improved as AIC and adjusted R-squared improved as it can be seen below. Note that several other linear regression models are also considered but not included in this report.

summary(m14c)
## 
## Call:
## lm(formula = sold_count ~ trend + month + price_lag1 + wday + 
##     basket_count_lag2 + category_sold_lag2 + category_visits_lag2 + 
##     category_favored_lag2, data = xiaomi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -373.51  -79.17  -22.54   57.96  545.07 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.290e+03  1.548e+02   8.334 2.06e-15 ***
## trend                 -1.099e+00  2.053e-01  -5.353 1.61e-07 ***
## month.L               -1.202e+02  4.272e+01  -2.813 0.005206 ** 
## month.Q                6.255e+01  4.098e+01   1.526 0.127914    
## month.C                1.031e+02  3.910e+01   2.636 0.008784 ** 
## month^4                9.411e+01  3.520e+01   2.674 0.007874 ** 
## month^5               -5.532e+01  3.001e+01  -1.843 0.066201 .  
## month^6                2.289e+01  3.217e+01   0.712 0.477202    
## month^7                6.888e+01  2.729e+01   2.524 0.012069 *  
## month^8               -1.119e+02  2.957e+01  -3.783 0.000184 ***
## month^9                1.125e+02  2.716e+01   4.142 4.37e-05 ***
## month^10               6.082e+01  2.711e+01   2.243 0.025521 *  
## month^11              -4.092e+01  2.871e+01  -1.425 0.154965    
## price_lag1            -6.383e+00  1.004e+00  -6.359 6.68e-10 ***
## wday.L                -1.713e+01  1.942e+01  -0.882 0.378432    
## wday.Q                -8.920e+01  1.899e+01  -4.696 3.88e-06 ***
## wday.C                 2.073e+01  1.902e+01   1.090 0.276405    
## wday^4                -1.737e+01  1.882e+01  -0.923 0.356519    
## wday^5                -7.285e+00  1.876e+01  -0.388 0.698079    
## wday^6                 6.256e+00  1.894e+01   0.330 0.741370    
## basket_count_lag2      2.454e-02  2.059e-02   1.192 0.234270    
## category_sold_lag2     3.466e-02  6.018e-02   0.576 0.565089    
## category_visits_lag2  -5.211e-03  1.060e-02  -0.491 0.623448    
## category_favored_lag2  5.593e-03  2.434e-03   2.298 0.022170 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133.9 on 334 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4996, Adjusted R-squared:  0.4651 
## F-statistic:  14.5 on 23 and 334 DF,  p-value: < 2.2e-16
AIC(m14c)
## [1] 4547.22

After, constructing the linear regression model above, an ARIMA model is used to model its residuals. First, residuals are checked for stationarity. Result of the KPSS unit root test shows that the residuals are stationary. According to the ACF and PACF plots of the residuals is given below. ACF is sinusodial and PACF spikes at lag 1 then decreases considerably. Thus, we can say that the residuals display an autoregressive behavior. Many alternative parameters are considered for the ARIMA model and two alternatives are chosen.

# ARIMA on residuals
summary(ur.kpss(random1, use.lag = 30))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 30 lags. 
## 
## Value of test-statistic is: 0.0483 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
tsdisplay(random1)

Finally, a third model is constructed which does not include the category visits and category favored as regressors since their autocorrelations are relatively low.

summary(m14b)
## 
## Call:
## lm(formula = sold_count ~ trend + month + price_lag1 + wday + 
##     basket_count_lag2 + category_sold_lag2, data = xiaomi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.20  -78.27  -19.28   46.35  577.08 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1214.32342  154.29987   7.870 4.92e-14 ***
## trend                -0.92840    0.19071  -4.868 1.74e-06 ***
## month.L            -101.84549   42.30631  -2.407 0.016608 *  
## month.Q              90.84927   39.74667   2.286 0.022894 *  
## month.C             112.09042   39.52697   2.836 0.004848 ** 
## month^4              68.12767   34.55503   1.972 0.049479 *  
## month^5             -72.50025   28.52656  -2.541 0.011487 *  
## month^6              12.33436   32.28929   0.382 0.702706    
## month^7              61.95489   27.56070   2.248 0.025228 *  
## month^8            -108.64850   29.87052  -3.637 0.000319 ***
## month^9             101.71077   27.24304   3.733 0.000222 ***
## month^10             58.26960   26.75964   2.178 0.030137 *  
## month^11            -36.11005   29.05164  -1.243 0.214749    
## price_lag1           -5.77564    0.98850  -5.843 1.21e-08 ***
## wday.L              -16.46351   19.65168  -0.838 0.402757    
## wday.Q              -88.08872   19.21402  -4.585 6.42e-06 ***
## wday.C               17.77913   19.24616   0.924 0.356266    
## wday^4              -17.86496   19.02855  -0.939 0.348482    
## wday^5               -6.16186   19.01078  -0.324 0.746045    
## wday^6               10.02001   19.02221   0.527 0.598712    
## basket_count_lag2     0.04057    0.01705   2.380 0.017883 *  
## category_sold_lag2    0.07886    0.03601   2.190 0.029219 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135.6 on 336 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.483,  Adjusted R-squared:  0.4507 
## F-statistic: 14.95 on 21 and 336 DF,  p-value: < 2.2e-16
AIC(m14b)
## [1] 4554.848

Also, as it can be seen below, residuals of the linear model is stationary.

# ARIMA on residuals
summary(ur.kpss(random2, use.lag = 30))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 30 lags. 
## 
## Value of test-statistic is: 0.0529 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
tsdisplay(random2)

3.1.2.1.1 Alternative 1

Results of the first alternative can be seen below. It included one autoregressive and one moving average term.

summary(d18)
## 
## Call:
## arima(x = random1, order = c(1, 0, 1), include.mean = FALSE)
## 
## Coefficients:
##          ar1     ma1
##       0.2301  0.2213
## s.e.  0.1206  0.1212
## 
## sigma^2 estimated as 13746:  log likelihood = -2213.68,  aic = 4433.36
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.07673911 117.2417 84.04904 -157.1807 389.6492 0.8533916
##                      ACF1
## Training set 0.0009375239
checkresiduals(d18)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1) with zero mean
## Q* = 81.239, df = 58, p-value = 0.02371
## 
## Model df: 2.   Total lags used: 60

As it can be seen above, residuals of the ARIMA model seems to satisfy the model assumptions. Error terms are distributed normally with zero mean and constant variance. Also, autocorrelations of the error terms are mostly insignificant.

tsdisplay(ts(xiaomi$sold_count-xiaomi$Model1Fitted))

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model1Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not significant and errors are distributed with zero mean and constant variance which satisfies the assumptions.

3.1.2.1.2 Alternative 2

For the second alternative, version of the linear regression which does not include category visits and category favored as regressors. Arima model for the residuals included three autoregressive and three moving average terms. Results of the arima model can be seen below.

summary(e115)
## 
## Call:
## arima(x = random2, order = c(3, 0, 3), include.mean = FALSE)
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2     ma3
##       1.8201  -0.8661  0.0165  -1.3969  0.0631  0.3392
## s.e.  0.2026   0.3261  0.1429   0.1947  0.2571  0.1031
## 
## sigma^2 estimated as 12979:  log likelihood = -2204.15,  aic = 4422.3
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.6584335 113.9255 81.32291 39.41063 196.4825 0.8192124
##                     ACF1
## Training set 0.002889852
checkresiduals(e115)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,3) with zero mean
## Q* = 64.648, df = 54, p-value = 0.1521
## 
## Model df: 6.   Total lags used: 60

As it can be seen above, residuals of the ARIMA model seems to satisfy the model assumptions. Error terms are distributed normally with zero mean and constant variance. Also, autocorrelations of the error terms are mostly insignificant. However, at lag 27 and 35 autocorrelations become significant but they stay relatively low.

tsdisplay(ts(xiaomi$sold_count-xiaomi$Model2Fitted))

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model2Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not significant and errors are distributed with zero mean and constant variance which satisfies the assumptions.

3.1.2.2 ARIMA Model

3.1.2.2.1 Alternative 3

In the second approach, an arima model is constructed directly from the sales data using the auto.arima function. As the series is not stationary, the first difference is taken. In the resulting model, one moving average and three autoregressive terms are included. Summary output of the model can be seen below.

summary(xiaomi_sarima)
## Series: xiaomi_ts 
## ARIMA(3,1,1) 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1
##       0.7439  -0.0465  0.0091  -0.9707
## s.e.  0.0557   0.0658  0.0548   0.0185
## 
## sigma^2 estimated as 14636:  log likelihood=-2229.64
## AIC=4469.29   AICc=4469.46   BIC=4488.7
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -4.164902 120.1373 83.56976 -11.65775 26.02221 0.9744875
##                      ACF1
## Training set -0.002471959
checkresiduals(xiaomi_sarima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)
## Q* = 10.546, df = 6, p-value = 0.1035
## 
## Model df: 4.   Total lags used: 10

As it can be seen above, residuals of the ARIMA model seems to satisfy the model assumptions. Error terms are distributed normally with zero mean and constant variance. Also, autocorrelations of the error terms are mostly insignificant.There is a slighly higher autocorrelation at lag 7. In the plot below you can see the fitted and real values of the sales data.

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model3Fitted, color = "fitted"))

3.1.2.3 Decomposition

3.1.2.3.1 Alternative 4

In the final approach, the sales data is decomposed using classical decomposition. Then, an arima model is fitted to the random component. The frequency of the series is chosen to be 28 since autocorrelation has the highest value at lag 28. To generate a prediction for the seasonal component, last available seasonal component is used. For the trend component, Holt’s method is used to generate forecasts.

plot(xiaomi_ts_dec)

When we analyze the decomposition plot, we see that the random component resembles a stationarity data with mean zero and constant variance. Also, trend component does not have any seasonality which suggests that most of the seasonal effects are being accounted for.

summary(r113)
## 
## Call:
## arima(x = xiaomi_dec_ran, order = c(2, 0, 1), include.mean = FALSE)
## 
## Coefficients:
##          ar1      ar2      ma1
##       1.5807  -0.6572  -1.0000
## s.e.  0.0409   0.0410   0.0095
## 
## sigma^2 estimated as 10852:  log likelihood = -2015.68,  aic = 4039.37
## 
## Training set error measures:
##                     ME     RMSE      MAE     MPE     MAPE      MASE
## Training set -5.484992 104.1709 74.39929 54.1406 137.9556 0.8747752
##                      ACF1
## Training set -0.006050699
checkresiduals(r113)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,1) with zero mean
## Q* = 87.144, df = 53, p-value = 0.002167
## 
## Model df: 3.   Total lags used: 56

The arima model fitted for the random component has two autoregressive and one moving average components. Before selecting the best model, several parameters are considered through a systematic search. The best model is chosen according to the AIC values. Residuals of the arima model seems to satisfy the model assumptions. There is only a slightly high autocorrelation at lag 35.

tsdisplay(ts(xiaomi$sold_count-xiaomi$Model4Fitted))

ggplot(xiaomi, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model4Fitted, color = "fitted"))

Finally, trend, seasonal and forecasted random components are combined to generate the fitted data. The comparison of the fitted and real data can be seen above.

3.1.3 Model Evaluation

eval
##           Model  n     mean       sd        CV      FBias      MAPE      RMSE
## 1 Alternative 1 15 438.6667 100.6909 0.2295386 0.10401620 0.1963970 106.32511
## 2 Alternative 2 15 438.6667 100.6909 0.2295386 0.10674917 0.1904766 104.55425
## 3 Alternative 3 15 438.6667 100.6909 0.2295386 0.06725751 0.1527301  85.22922
## 4 Alternative 4 15 438.6667 100.6909 0.2295386 0.07843792 0.2194289 124.73987
## 5         Naive 15 438.6667 100.6909 0.2295386 0.06534954 0.1912518 100.10728
##        MAD      MADP     WMAPE
## 1 87.90291 0.2003866 0.2003866
## 2 86.12033 0.1963229 0.1963229
## 3 69.67817 0.1588408 0.1588408
## 4 93.62664 0.2134346 0.2134346
## 5 85.86667 0.1957447 0.1957447

Finally, models are tested between 28 May 2021 and 11 May 2021. Prediction are made for the day after tomorrow. The results of the tests are summarized in the above table along with the Naive forecasts. Alternative 3 is the only model that generates better results than the Naive forecasts. Also, Alternative 2 is very close to the Naive forecast.

To generate forecasts for this project, predictions of the Alternative 2 and Alternative 3 are considered.

3.2 Fakir

The second product to be forecasted is the bluetooth headphones of Fakir. It is one of the lowest selling products compared to other products considered in this project. It has a relatively stable price.

rm(list = ls())
load("C:/Users/alpsr/Desktop/Project/Project Final/Fakir_environment.RData")

3.2.1 Descriptive Analysis

Time series of the product sales fluctuates significantly. The series both has trend and seasonal components. Also, the variance of the series seems to be constant although there are some occasional extreme observations.

# ggplot(fakir) +
#   geom_boxplot(aes(y = sold_count))
ggplot(fakir, aes(x = event_date)) +
  geom_line(aes(y = sold_count))

Histogram of the sales data shows that the distribution is close to normal. However, there are some deviations from the normal distribution such as the long right tail of the distribution.

ggplot(fakir) +
  geom_histogram(aes(x = sold_count))

Autocorrelation and partial autocorrelation plots of the sales data is shown below. Decreasing autocorrelation suggest that there is a trend in the data. Also, partial autocorrelations becomes insignificant after lag 2 which suggests an autoregressive behavior.

acf(fakir$sold_count, lag.max = 40) 

pacf(fakir$sold_count, lag.max = 40)

3.2.2 Alternative Models

3.2.2.1 Dynamic Regression

ggpairs(fakir[,c(4,17:21)]) # visit count lag, price, basket lag, cat sold lag, favored_count lag, cat favored lag

In the dynamic regression approach, all variables except price are used with lag 2 as we only have access data from two days before. Price data from the previous day is available. As it can be from the plot above, all variables are correlated with the sales data to some degree and can be used in the linear regression. The first linear regression model considered included trend, month and day of the week variables and its results were as follows.

summary(m11)
## 
## Call:
## lm(formula = sold_count ~ trend + month, data = fakir)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.361  -7.694  -1.326   4.880  45.411 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.54458    2.72962  14.854  < 2e-16 ***
## trend        -0.04775    0.01442  -3.310 0.001036 ** 
## month.L      17.36882    3.24443   5.353 1.63e-07 ***
## month.Q       3.74187    3.00300   1.246 0.213638    
## month.C       6.43122    3.13947   2.049 0.041306 *  
## month^4     -13.53968    2.90895  -4.654 4.72e-06 ***
## month^5     -15.88226    2.44505  -6.496 3.07e-10 ***
## month^6       9.76202    2.81245   3.471 0.000588 ***
## month^7      10.72825    2.30526   4.654 4.74e-06 ***
## month^8       0.80342    2.52604   0.318 0.750646    
## month^9      -3.04109    2.33584  -1.302 0.193854    
## month^10     -1.33884    2.31077  -0.579 0.562722    
## month^11      1.44541    2.43072   0.595 0.552492    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.05 on 328 degrees of freedom
## Multiple R-squared:  0.3967, Adjusted R-squared:  0.3746 
## F-statistic: 17.97 on 12 and 328 DF,  p-value: < 2.2e-16
AIC(m11)
## [1] 2679.805

Before using the residuals of this model in the arima models we need to check whether it is stationary. Result of the unit-root test show that the residuals are stationary.

summary(ur.kpss(random1, use.lag = 32))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 32 lags. 
## 
## Value of test-statistic is: 0.0574 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
tsdisplay(random1)

In the second alternative, day of the week variables is added to the model. The model does not improve both AIC and adjusted R-squared values decrease.

summary(m12)
## 
## Call:
## lm(formula = sold_count ~ trend + month + wday, data = fakir)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.475  -7.230  -1.553   4.241  44.690 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.60779    2.73168  14.866  < 2e-16 ***
## trend        -0.04821    0.01444  -3.340 0.000937 ***
## month.L      17.26346    3.24760   5.316 1.99e-07 ***
## month.Q       3.83389    3.00595   1.275 0.203075    
## month.C       6.47147    3.14485   2.058 0.040414 *  
## month^4     -13.62563    2.91144  -4.680 4.23e-06 ***
## month^5     -15.81095    2.44976  -6.454 4.00e-10 ***
## month^6       9.97599    2.81777   3.540 0.000458 ***
## month^7      10.79234    2.30755   4.677 4.29e-06 ***
## month^8       0.88361    2.52813   0.350 0.726936    
## month^9      -2.75505    2.34425  -1.175 0.240769    
## month^10     -1.36543    2.31323  -0.590 0.555423    
## month^11      1.26775    2.43425   0.521 0.602865    
## wday.L       -3.72659    1.70644  -2.184 0.029695 *  
## wday.Q        1.16399    1.73338   0.672 0.502373    
## wday.C        0.12785    1.72361   0.074 0.940915    
## wday^4        0.67859    1.71815   0.395 0.693141    
## wday^5       -0.26378    1.74024  -0.152 0.879614    
## wday^6        0.66253    1.76699   0.375 0.707943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.05 on 322 degrees of freedom
## Multiple R-squared:  0.4069, Adjusted R-squared:  0.3738 
## F-statistic: 12.27 on 18 and 322 DF,  p-value: < 2.2e-16
AIC(m12)
## [1] 2685.958

Finally, a linear regression model with external regressors along with time series variables are considered. This models has a better AIC and adjusted R-squared values.

summary(m14c)
## 
## Call:
## lm(formula = sold_count ~ trend + month + price_lag1 + wday + 
##     basket_count_lag2 + category_sold_lag2 + category_visits_lag2 + 
##     category_favored_lag2, data = fakir)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.547  -6.271  -0.834   5.307  32.735 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.719e+02  2.327e+01   7.385 1.37e-12 ***
## trend                  1.173e-02  2.044e-02   0.574 0.566480    
## month.L                5.088e-01  3.020e+00   0.169 0.866289    
## month.Q                1.517e+01  3.267e+00   4.643 5.05e-06 ***
## month.C                1.061e+01  2.835e+00   3.744 0.000215 ***
## month^4               -1.073e+01  2.480e+00  -4.325 2.05e-05 ***
## month^5               -9.695e+00  2.079e+00  -4.663 4.61e-06 ***
## month^6                1.163e+01  2.320e+00   5.015 8.89e-07 ***
## month^7                1.100e+01  1.783e+00   6.166 2.14e-09 ***
## month^8                2.303e+00  2.062e+00   1.117 0.264838    
## month^9               -1.502e+00  1.841e+00  -0.816 0.415100    
## month^10               4.283e+00  1.929e+00   2.220 0.027108 *  
## month^11              -2.976e+00  2.054e+00  -1.449 0.148282    
## price_lag1            -5.927e-01  9.699e-02  -6.111 2.92e-09 ***
## wday.L                -3.318e+00  1.335e+00  -2.486 0.013451 *  
## wday.Q                 8.167e-01  1.349e+00   0.605 0.545476    
## wday.C                 1.688e-01  1.334e+00   0.126 0.899436    
## wday^4                 2.425e+00  1.325e+00   1.831 0.068095 .  
## wday^5                -2.982e-02  1.342e+00  -0.022 0.982281    
## wday^6                 5.652e-01  1.357e+00   0.416 0.677427    
## basket_count_lag2      3.479e-02  1.355e-02   2.567 0.010708 *  
## category_sold_lag2     2.503e-02  1.712e-02   1.462 0.144673    
## category_visits_lag2  -1.815e-02  4.550e-03  -3.990 8.23e-05 ***
## category_favored_lag2  5.110e-03  8.893e-04   5.747 2.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.212 on 315 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6609, Adjusted R-squared:  0.6361 
## F-statistic: 26.69 on 23 and 315 DF,  p-value: < 2.2e-16
AIC(m14c)
## [1] 2492.665

Before using the residuals of this model in the arima models we need to check whether it is stationary. Result of the unit-root test show that the residuals are stationary.

summary(ur.kpss(random2, use.lag = 30))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 30 lags. 
## 
## Value of test-statistic is: 0.0409 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
tsdisplay(random2)

3.2.2.1.1 Alternative 1

Results of the first alternative can be seen below. Linear regression with external variables is used in this model. Several alternatives are considered for the arima parameters. The best arima model on residuals have one autoregressive component.

summary(d27)
## 
## Call:
## arima(x = random1, order = c(1, 0, 0), include.mean = FALSE)
## 
## Coefficients:
##          ar1
##       0.3269
## s.e.  0.0514
## 
## sigma^2 estimated as 70.44:  log likelihood = -1202.25,  aic = 2408.5
## 
## Training set error measures:
##                        ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.003331343 8.392679 6.439086 11.09002 178.1962 0.8022363
##                     ACF1
## Training set 0.005281945
checkresiduals(d27)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0) with zero mean
## Q* = 83.001, df = 63, p-value = 0.04648
## 
## Model df: 1.   Total lags used: 64

As it can be seen above, residuals of the arima model seems to satisfy the model assumptions.

tsdisplay(ts(fakir$sold_count-fakir$Model1Fitted))

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model1Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not significant and errors are distributed with zero mean and constant variance which satisfies the assumptions.

3.2.2.1.2 Alternative 2

In the second alternative, linear regression with trend and month variables is used in the model. Then, an arima model is fitted to the residuals. The arima model choosen has one autoregressive and one moving average terms and its results are as follows.

# ARIMA on residuals
summary(e110)
## 
## Call:
## arima(x = random2, order = c(1, 0, 1), include.mean = FALSE)
## 
## Coefficients:
##          ar1     ma1
##       0.6795  -0.161
## s.e.  0.0727   0.100
## 
## sigma^2 estimated as 93.37:  log likelihood = -1257.55,  aic = 2521.1
## 
## Training set error measures:
##                       ME     RMSE      MAE     MPE     MAPE      MASE
## Training set -0.03742546 9.662789 7.405348 67.1264 226.1449 0.8867285
##                     ACF1
## Training set 0.005533995
checkresiduals(e110)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1) with zero mean
## Q* = 43.691, df = 40, p-value = 0.3175
## 
## Model df: 2.   Total lags used: 42

As it can be seen above, model assumptions for the arima model are satisfied. There is only a slightly significant autocorrelation at lag 20.

tsdisplay(ts(fakir$sold_count-fakir$Model2Fitted))

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model2Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not significant and errors are distributed with zero mean and constant variance which satisfies the assumptions.

3.2.2.1.3 Alternative 3

In the second alternative, linear regression with trend and month variables is used in the model. Then, an arima model is fitted to the residuals. The arima model is choosen with autoarima function which suggested a sarima model with one autoregressive, one moving average and two seasonal autoregressive terms. The results of the arima models are as follows.

summary(autoarima2)
## Series: random2 
## ARIMA(1,0,1)(2,0,0)[21] with zero mean 
## 
## Coefficients:
##          ar1      ma1    sar1    sar2
##       0.6800  -0.1594  0.0084  0.0172
## s.e.  0.0731   0.0997  0.0572  0.0613
## 
## sigma^2 estimated as 94.44:  log likelihood=-1257.5
## AIC=2525   AICc=2525.17   BIC=2544.15
## 
## Training set error measures:
##                       ME     RMSE      MAE      MPE     MAPE     MASE
## Training set -0.03673997 9.661089 7.398634 64.36632 225.8974 0.558102
##                     ACF1
## Training set 0.005435633
checkresiduals(autoarima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(2,0,0)[21] with zero mean
## Q* = 43.496, df = 38, p-value = 0.2489
## 
## Model df: 4.   Total lags used: 42

As it can be seen above, model assumptions for the sarima model are satisfied. There is only a slightly significant autocorrelation at lag 20.

tsdisplay(ts(fakir$sold_count-fakir$Model3Fitted))

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model2Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not significant and errors are distributed with zero mean which satisfies the assumptions

3.2.2.2 SARIMA

3.2.2.2.1 Alternative 4

In the second approach, a sarima model is constructed directly from the sales data using the auto.arima function. As the series is not stationary, the first difference is taken. In the resulting model, one moving average, one autoregressive term and two seasonal autoregressive terms are included. Summary output of the model can be seen below.

summary(fakir_sarima)
## Series: fakir_ts 
## ARIMA(1,1,1)(2,0,0)[21] 
## 
## Coefficients:
##          ar1      ma1    sar1    sar2
##       0.3975  -0.8122  0.0277  0.0177
## s.e.  0.0914   0.0618  0.0569  0.0620
## 
## sigma^2 estimated as 99.29:  log likelihood=-1262.35
## AIC=2534.7   AICc=2534.88   BIC=2553.84
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.237508 9.891259 7.700547 -12.40813 31.45124 0.5421727
##                      ACF1
## Training set -0.009314196
checkresiduals(fakir_sarima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(2,0,0)[21]
## Q* = 33.103, df = 38, p-value = 0.6951
## 
## Model df: 4.   Total lags used: 42

As it can be seen above, residuals of the ARIMA model seems to satisfy the model assumptions. Error terms are approximately distributed normally with zero mean and constant variance. Also, autocorrelations of the error terms are insignificant. In the plot below you can see the fitted and real values of the sales data.

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model4Fitted, color = "fitted"))

3.2.2.2.2 Alternative 5

In this alternative an arima model is constructed directly from the sales data using the auto.arima function. No seasonal terms are used. As the series is not stationary, the first difference is taken. In the resulting model, three moving average terms are included. Summary output of the model can be seen below.

summary(fakir_sarima2)
## Series: fakir_ts 
## ARIMA(0,1,3) 
## 
## Coefficients:
##           ma1      ma2      ma3
##       -0.4309  -0.1449  -0.0907
## s.e.   0.0539   0.0611   0.0535
## 
## sigma^2 estimated as 99.23:  log likelihood=-1262.73
## AIC=2533.46   AICc=2533.58   BIC=2548.78
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE     MAPE     MASE
## Training set -0.2287261 9.903002 7.713351 -12.2549 31.40819 0.932956
##                     ACF1
## Training set 0.002915125
checkresiduals(fakir_sarima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,3)
## Q* = 8.4205, df = 7, p-value = 0.297
## 
## Model df: 3.   Total lags used: 10

As it can be seen above, residuals of the ARIMA model seems to satisfy the model assumptions. Error terms are approximately distributed normally with zero mean and constant variance. Also, autocorrelations of the error terms are insignificant. In the plot below you can see the fitted and real values of the sales data.

tsdisplay(fakir_sarima2$residuals)

ggplot(fakir, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model5Fitted, color = "fitted"))

3.2.3 Model Evaluation

eval
##           Model  n mean       sd        CV         FBias      MAPE      RMSE
## 1 Alternative 1 15 20.2 7.598872 0.3761818  0.4189800374 0.4934498 13.827284
## 2 Alternative 2 15 20.2 7.598872 0.3761818 -0.0006267828 0.3883138  8.571214
## 3 Alternative 3 15 20.2 7.598872 0.3761818 -0.0008082599 0.3896311  8.623990
## 4 Alternative 4 15 20.2 7.598872 0.3761818  0.1368348898 0.3005775  8.401431
## 5 Alternative 5 15 20.2 7.598872 0.3761818  0.1386393982 0.2904862  8.197339
## 6         Naive 15 20.2 7.598872 0.3761818  0.0594059406 0.3813538  9.556847
##         MAD      MADP     WMAPE
## 1 10.722915 0.5308374 0.5308374
## 2  6.574761 0.3254832 0.3254832
## 3  6.577743 0.3256308 0.3256308
## 4  6.092091 0.3015887 0.3015887
## 5  5.982224 0.2961497 0.2961497
## 6  7.733333 0.3828383 0.3828383

Finally, models are tested between 28 May 2021 and 11 May 2021. Prediction are made for the day after tomorrow. The results of the tests are summarized in the above table along with the Naive forecasts. Alternatives 2, 3, 4 and 5 generate better results than the Naive forecasts. their results are very close to each other. Thus, to generate forecasts for this project, predictions of these alternatives are considered.

3.3 Altınyıldız

The third product to be forecasted is a male jacket of brand Altınyıldız. It is the lowest selling product compared to other products. In the current its sales are very low and Its price is mostly stable. Also, during the forecasting phase, Father’s Day played a role in its sales.

rm(list = ls())
load("C:/Users/alpsr/Desktop/Project/Project Final/mont_environment.RData")

3.3.1 Descriptive Analysis

ggplot(mont, aes(x = event_date)) +
  geom_line(aes(y = sold_count))

Time series for the sales of this product show that for the majority of the period considered, sales of this product equals to zero. Also, during these days, price data of the product is also not available.

ggplot(mont) +
  geom_histogram(aes(x = sold_count))

Histogram of the sales data does not show a normal distribution. It shows that for most of the days sales are equal or close to zero. Having such a time series data made the forecasting a harder task as the data had limited trend and seasonality.

3.3.2 Alternative Models

3.3.2.1 Dynamic Regression

ggpairs(mont[,c(4,17:21)])

In the dynamic regression approach, all variables whose data is not dirty except price are used with lag 2 as we only have access data from two days before. Price data from the previous day is available. Only basket count has a significant correlation with the sales. The first linear regression model considered included month and day of the week variables and its results were as follows.

summary(m11)
## 
## Call:
## lm(formula = sold_count ~ month + wday, data = mont)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.479 -0.626  0.061  0.309 44.521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.80136    0.16261   4.928 1.28e-06 ***
## month.L      2.30455    0.56578   4.073 5.74e-05 ***
## month.Q      0.44214    0.56150   0.787 0.431570    
## month.C     -1.80083    0.56084  -3.211 0.001445 ** 
## month^4     -2.63004    0.56416  -4.662 4.46e-06 ***
## month^5     -2.24737    0.56611  -3.970 8.73e-05 ***
## month^6      0.30607    0.57152   0.536 0.592623    
## month^7      2.30074    0.56333   4.084 5.49e-05 ***
## month^8      2.22763    0.56197   3.964 8.94e-05 ***
## month^9      1.89711    0.56578   3.353 0.000887 ***
## month^10     1.61465    0.55785   2.894 0.004037 ** 
## month^11     0.05018    0.55977   0.090 0.928615    
## wday.L      -0.10953    0.43181  -0.254 0.799919    
## wday.Q      -0.94503    0.43128  -2.191 0.029094 *  
## wday.C      -0.07909    0.43058  -0.184 0.854363    
## wday^4       0.48674    0.43008   1.132 0.258511    
## wday^5       0.11062    0.42909   0.258 0.796704    
## wday^6      -0.69422    0.42834  -1.621 0.105977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.116 on 350 degrees of freedom
## Multiple R-squared:  0.2651, Adjusted R-squared:  0.2294 
## F-statistic: 7.428 on 17 and 350 DF,  p-value: 1.177e-15
AIC(m11)
## [1] 1900.474

The second linear regression model considered included, trend, month and day of the week variables. Its adjusted R-squared value slightly improved but AIC value decreased. We have not considered any additional variables because the data is either not siginificant or missing.

summary(m12)
## 
## Call:
## lm(formula = sold_count ~ trend + month + wday, data = mont)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.470 -0.674  0.050  0.302 44.475 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.083413   0.698936   0.119 0.905073    
## trend        0.003900   0.003692   1.056 0.291623    
## month.L      2.931179   0.819761   3.576 0.000399 ***
## month.Q     -0.094471   0.757178  -0.125 0.900780    
## month.C     -2.414068   0.807193  -2.991 0.002981 ** 
## month^4     -2.146912   0.726243  -2.956 0.003327 ** 
## month^5     -2.032574   0.601446  -3.379 0.000808 ***
## month^6     -0.101684   0.689621  -0.147 0.882863    
## month^7      2.303634   0.563243   4.090 5.36e-05 ***
## month^8      2.534905   0.632732   4.006 7.54e-05 ***
## month^9      1.756977   0.581038   3.024 0.002681 ** 
## month^10     1.430542   0.584364   2.448 0.014856 *  
## month^11     0.291745   0.604606   0.483 0.629727    
## wday.L      -0.107170   0.431749  -0.248 0.804107    
## wday.Q      -0.946062   0.431212  -2.194 0.028897 *  
## wday.C      -0.079774   0.430513  -0.185 0.853101    
## wday^4       0.483146   0.430020   1.124 0.261979    
## wday^5       0.108326   0.429021   0.252 0.800807    
## wday^6      -0.694406   0.428268  -1.621 0.105829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.116 on 349 degrees of freedom
## Multiple R-squared:  0.2675, Adjusted R-squared:  0.2297 
## F-statistic: 7.079 on 18 and 349 DF,  p-value: 2.018e-15
AIC(m12)
## [1] 1901.3
3.3.2.1.1 Alternative 1

In the first alternative we have considered the first linear regression model without any arima on its residuals. Its residuals does not seem to satisfy the model assumptions. First, it has a high autocorrelation at lag 1. Also, residuals fluctuate greatly and do not have a constant variance.

checkresiduals(m11)

## 
##  Breusch-Godfrey test for serial correlation of order up to 21
## 
## data:  Residuals
## LM test = 56.813, df = 21, p-value = 3.83e-05
ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model1Fitted, color = "fitted"))

3.3.2.1.2 Alternative 2

Before constructing an arima model on the residuals of the first linear regression model, its stationarity is checked. Results of the unit root test show that the residuals are stationary.

# ARIMA on residuals
summary(ur.kpss(random1, use.lag = 30))
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 30 lags. 
## 
## Value of test-statistic is: 0.0571 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

After trying different alternatives for the arima model, an arima model with one autoregressive and two moving average terms is chosen. Its summary is as follows.

summary(d36)
## 
## Call:
## arima(x = random1, order = c(1, 0, 2), include.mean = FALSE)
## 
## Coefficients:
##          ar1      ma1      ma2
##       0.9100  -0.7099  -0.2901
## s.e.  0.0237   0.0540   0.0535
## 
## sigma^2 estimated as 8.316:  log likelihood = -913.4,  aic = 1834.79
## 
## Training set error measures:
##                       ME     RMSE       MAE      MPE     MAPE      MASE
## Training set -0.04223264 2.883762 0.9959756 3.501291 167.0976 0.8709728
##                     ACF1
## Training set 0.003196453
checkresiduals(d36)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,2) with zero mean
## Q* = 18.285, df = 7, p-value = 0.01075
## 
## Model df: 3.   Total lags used: 10

Residuals does not seem to satisfy the model assumptions. The model is unable to detect the extreme observations.

tsdisplay(ts(mont$sold_count-mont$Model2Fitted))

ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model2Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not highly significant.

3.3.2.1.3 Alternative 3

A second arima model is fitted to the resiudals of the first linear regression. It only has one moving average term.

summary(autoarima1)
## Series: random1 
## ARIMA(0,0,1) with zero mean 
## 
## Coefficients:
##          ma1
##       0.2550
## s.e.  0.0524
## 
## sigma^2 estimated as 8.725:  log likelihood=-920.28
## AIC=1844.56   AICc=1844.59   BIC=1852.38
## 
## Training set error measures:
##                         ME    RMSE       MAE      MPE     MAPE     MASE
## Training set -0.0002290315 2.94978 0.9965167 60.59157 135.3301 0.871446
##                      ACF1
## Training set -0.006791424
checkresiduals(autoarima1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with zero mean
## Q* = 22.341, df = 9, p-value = 0.007858
## 
## Model df: 1.   Total lags used: 10

The model is still unable to detect the extreme observations.

tsdisplay(ts(mont$sold_count-mont$Model3Fitted))

ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model3Fitted, color = "fitted"))

Finally, the linear regression and arima model is combined. Analysis of the residuals and fitted values can be seen above. The autocorrelation of the errors are not highly significant.

3.3.2.2 SARIMAX Model

3.3.2.2.1 Alternative 4

In this approach a SARIMAX model is constructed using the autoarima function which includes basket count and category sold with lag 2 as external regressors. The resulting model included 2 autoregressive and 2 moving average terms. The summary of the model can be seen below.

summary(mont_sarimax) # regressoes basekt count and category sold
## Series: mont_ts[3:length(mont_ts)] 
## Regression with ARIMA(2,0,2) errors 
## 
## Coefficients:
##          ar1     ar2      ma1      ma2   xreg1    xreg2
##       0.4574  0.5065  -0.1079  -0.7246  0.0326  -0.0011
## s.e.  0.0875  0.0828   0.0654   0.0540  0.0130   0.0008
## 
## sigma^2 estimated as 9.85:  log likelihood=-935.28
## AIC=1884.56   AICc=1884.87   BIC=1911.88
## 
## Training set error measures:
##                     ME     RMSE       MAE MPE MAPE      MASE       ACF1
## Training set 0.1856612 3.112641 0.7563034 NaN  Inf 0.9421527 0.01715957

Model assumptions about the residuals are not satisfied. The residuals are not distributed normally ant the model is unable to predict extreme observations accurately.

checkresiduals(mont_sarimax)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,0,2) errors
## Q* = 19.001, df = 4, p-value = 0.0007857
## 
## Model df: 6.   Total lags used: 10

Below, you can see the comparison of fitted and real values of the sales data.

ggplot(mont, aes(x = event_date))+
  geom_line(aes(y = sold_count, color = "sold_count"))+
  geom_line(aes(y = Model4Fitted, color = "fitted"))

3.3.3 Model Evaluation

eval
##   n     mean       sd       CV      FBias MAPE     RMSE      MAD      MADP
## 1 7 1.285714 1.253566 0.974996 -0.3410640  Inf 1.240657 1.039396 0.8084194
## 2 7 1.285714 1.253566 0.974996  0.5676211  Inf 1.370965 1.206298 0.9382316
## 3 7 1.285714 1.253566 0.974996 -0.3410640  Inf 1.240657 1.039396 0.8084194
## 4 7 1.285714 1.253566 0.974996  0.6465205  Inf 1.427550 1.220789 0.9495029
## 5 7 1.285714 1.253566 0.974996 -0.5555556  NaN 1.647509 1.285714 1.0000000
##       WMAPE
## 1 0.8084194
## 2 0.9382316
## 3 0.8084194
## 4 0.9495029
## 5 1.0000000
rm(list = ls())

Finally, models are tested between 28 May 2021 and 11 May 2021. Prediction are made for the day after tomorrow. The results of the tests are summarized in the above table along with the Naive forecasts. All models are better than the Naive forecasts. Alternatives 1 and 3 are the best performing models.

To generate forecasts for this project, predictions of the Alternative 2 and Alternative 3 are considered.

data = get_data(token=token,url=subm_url)

rawdata <- read.csv("C:/Users/alpsr/Desktop/Project/Project Final/ProjectRawData.csv")

rawdata <- na.omit(rawdata,event_date)

fetched_data_filtered <- data %>%
  mutate(event_date=as.Date(event_date)) %>% 
  filter(event_date>"2021-05-31") %>%
  select(colnames(rawdata)) %>%
  arrange(event_date)

updated_data <- rbind.data.frame(fetched_data_filtered,rawdata)

updated_data <- updated_data %>%
  filter(event_date<="2021-06-11") # we selected "2021-05-28"-"2021-06-11" as our test period

3.4 Oral B - Sarj Edilebilir Dis Fircasi

dis_fircasi <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==32939029) 
ggplot(dis_fircasi, aes(x=event_date))+
  geom_line(aes(y=sold_count)) +
  labs(title="Sales of Oral B over time")

Variance change over time, and there are periods when variance is higher than the general habit of the data. So I decided to take the logarithm of the sold_count and use it while building models.

dis_fircasi[,log_sold:=log(sold_count)]
acf(dis_fircasi$sold_count, na.action = na.pass)

pacf(dis_fircasi$sold_count, na.action = na.pass)

We see a relatively higher correlation in lag 7, which indicates weekly seasonality.

While making predictions, we checked the price of products on the website, in order to have a more up to date information.

ggplot(dis_fircasi, aes(x=event_date))+
  geom_line(aes(y=log_sold))+
  labs(title="log(Sales) of Oral B over time")

w <- which(dis_fircasi$log_sold==0)
dis_fircasi$log_sold[w] <- mean(c(dis_fircasi$log_sold[31],dis_fircasi$log_sold[34]))
plot(dis_fircasi$price,dis_fircasi$log_sold)

Although we do not see a clear negative correlation, we can say that if price of the toothbrush affects the sales: If price is below its general level, sales increase. I decided the use the price as regressor because we have the opportunity to check price from the website. It is kind of problematic to use other variables in the dataset as regressors because we would have to forecast also the variable, or use lagged values, which may decrease the performance of our model.

3.4.1 Model 1 - Time Series Decomposition & ARIMA on Random Component

In the acf plot, there is autocorrelation in lag 7. So I decided that time series decomposition might be suitable for this series.

The code to test our model:

test_dates <- c(as.Date("2021-05-28"):as.Date("2021-06-11")) #Update here
for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- dis_fircasi[event_date<=current_date,]
  
  dis_fircasi_ts <- ts(past_data$log_sold, frequency = 7)  
  dis_fircasi_decomposed <- decompose(dis_fircasi_ts, type="additive")
  model <- arima(dis_fircasi_decomposed$random,order = c(2,1,2),seasonal = list(order=c(1,0,0)),xreg = past_data$price)
  forecasted=predict(model,n.ahead = 2,newxreg = dis_fircasi[event_date==test_dates[i],price])
  dis_fircasi[nrow(dis_fircasi)-length(test_dates)+i, Model1 := forecasted$pred[2]+dis_fircasi_decomposed$seasonal[(nrow(dis_fircasi)-length(test_dates)+i)%%7+7]+dis_fircasi_decomposed$trend[max(which(!is.na(dis_fircasi_decomposed$trend)))]]
}
m<-accu(dis_fircasi$sold_count[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))],exp(dis_fircasi$Model1[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))]))

3.4.2 Model 2 - Time Series Decomposition & ARIMA on Random Component

The code to test the model:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- dis_fircasi[event_date<=current_date,]
    
    dis_fircasi_ts <- ts(past_data$log_sold, frequency = 7)  
    dis_fircasi_decomposed <- decompose(dis_fircasi_ts, type="additive")
    model <- arima(dis_fircasi_decomposed$random,order = c(1,0,0),seasonal = list(order=c(1,0,0)),xreg = past_data$price)
    forecasted=predict(model,n.ahead = 2,newxreg = dis_fircasi[event_date==test_dates[i],price])
    dis_fircasi[nrow(dis_fircasi)-length(test_dates)+i, Model2 := forecasted$pred[2]+dis_fircasi_decomposed$seasonal[(nrow(dis_fircasi)-length(test_dates)+i)%%7+7]+dis_fircasi_decomposed$trend[max(which(!is.na(dis_fircasi_decomposed$trend)))]]
  }
  m2<-accu(dis_fircasi$sold_count[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))],exp(dis_fircasi$Model2[(nrow(dis_fircasi)+1-length(test_dates)):(nrow(dis_fircasi))]))  
dis_fircasi[,pred1:=exp(Model1)]
dis_fircasi[,pred2:=exp(Model2)]

3.5 Sleepy Bebek Islak Mendil

bebek_mendil <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==4066298)
ggplot(bebek_mendil, aes(x=event_date))+
  geom_line(aes(y=sold_count))+
  labs(title="Sales of Sleepy over time")

Variance increase over time, again I will take the logarithm of the sold_count.

acf(bebek_mendil$sold_count, na.action = na.pass)

pacf(bebek_mendil$sold_count, na.action = na.pass)

There is not any significant seasonality. Autoregressive models may be used.

bebek_mendil[,log_sold:=log(sold_count)]
bebek_mendil[,lagged:=lag(log_sold)]
bebek_mendil[,popular_period:=ifelse(ty_visits>130000000,1,0)] 
ggplot(bebek_mendil, aes(x=event_date))+
  geom_line(aes(y=log_sold))+
  labs(title="log(Sales) of Sleepy over time")

We do not see a clear trend nor seasonality.

bebek_mendil$log_sold %>%
  ur.kpss() %>%
  summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.6258 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

We have evidence against the stationarity of the data.

plot(bebek_mendil$price,bebek_mendil$log_sold)

We see that price is negatively correlated with log(sales). Regardless of the type of model we use, price should be a regressor.

3.5.1 Model 1 - ARIMA with external regressors

Although the data is not stationary, I thought that ARIMA with regressors would be a good model for this series.

Code to test the model:

test_dates <- c(as.Date("2021-05-28"):as.Date("2021-06-11")) #Update here

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- bebek_mendil[event_date<=current_date,]
  fitted_arima=arima(past_data$log_sold, xreg = past_data$price,order=c(0,1,5))
  forecasted=predict(fitted_arima,newxreg = bebek_mendil[event_date==test_dates[i],price],n.ahead = 2)
  bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i, Model1 := forecasted$pred[2]]
}

s<-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],exp(bebek_mendil$Model1[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))]))

3.5.2 Model 2 - Linear regression & ARIMA on residuals

I tried these linear regression models:

lm(log_sold~lagged,bebek_mendil)
lm(log_sold~lagged+price,bebek_mendil)
lm(log_sold~lagged+price+ty_visits,bebek_mendil)
lm(log_sold~lagged+price+ty_visits+as.factor(month(event_date)),bebek_mendil)
lm(log_sold~lagged+price+ty_visits+as.factor(month(event_date))+as.factor(weekdays(event_date)))
lm(log_sold~lagged+price+as.factor(month(event_date))+as.factor(weekdays(event_date)))

And decided to use this:

lm(log_sold~lagged+price+as.factor(month(event_date))+as.factor(weekdays(event_date)))

Code to test model:

test_dates <- c(as.Date("2021-05-28"):as.Date("2021-06-11")) #Update here
fmla <- "log_sold~lagged+price+as.factor(month(event_date))+as.factor(weekdays(event_date))"
for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- bebek_mendil[event_date<=current_date,]
  fitted_lm=lm(as.formula(fmla),past_data)
  rm <- arima(fitted_lm$residual, order = c(1,0,0))
  forecasted=predict(fitted_lm,bebek_mendil[event_date == test_dates[i],])
  bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i, Model2 := forecasted+forecast(rm,h=2)$mean[2]]
}

s2 <-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],exp(bebek_mendil$Model2[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))])) 

3.5.3 Model 3 - Linear regression & ARIMA on residuals

After making predictions for a few days, we observed that our model does not capture the effect of the increased visits in popular periods of the website. So we added popular_period variable to temper the effect.

Code to test the model:

fmla2 <- "log_sold~lagged+price+as.factor(month(event_date))+as.factor(weekdays(event_date))+as.factor(popular_period)"
  for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- bebek_mendil[event_date<=current_date,]
    fitted_lm=lm(as.formula(fmla2),past_data)
    rm <- arima(fitted_lm$residual, order = c(1,0,0))
    forecasted=predict(fitted_lm,bebek_mendil[event_date == test_dates[i],])
    bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i, Model3 := forecasted+forecast(rm,h=2)$mean[2]]
  }
  
  s3 <-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],exp(bebek_mendil$Model3[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))]))

3.5.4 Model 4 - ARIMA with external regressors

Code to test the model:

  for(i in 1:length(test_dates)){

    current_date=test_dates[i]-2

    past_data <- bebek_mendil[event_date<=current_date,]
    fitted_arima=arima(past_data$log_sold, xreg = cbind(past_data$price,past_data$popular_period),order=c(0,1,5))
    forecasted=predict(fitted_arima,newxreg = cbind(bebek_mendil[event_date==test_dates[i],price],bebek_mendil[event_date==test_dates[i],popular_period]),n.ahead = 2)
    bebek_mendil[nrow(bebek_mendil)-length(test_dates)+i, Model4 := forecasted$pred[2]]
  }

    s4<-accu(bebek_mendil$sold_count[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))],exp(bebek_mendil$Model4[(nrow(bebek_mendil)+1-length(test_dates)):(nrow(bebek_mendil))]))
bebek_mendil[,pred1:=exp(Model1)]
bebek_mendil[,pred2:=exp(Model2)]
bebek_mendil[,pred3:=exp(Model3)]
bebek_mendil[,pred4:=exp(Model4)]

3.6 La Roche Posay Temizleme Jeli

yuz_temizleyici <- updated_data %>%
  mutate(event_date=as.Date(event_date)) %>%
  arrange(event_date) %>%
  filter(product_content_id==85004)
ggplot(yuz_temizleyici, aes(x=event_date))+
  geom_line(aes(y=sold_count))+
  labs(title="Sales of La Roche over time")

Again, we see periods that variance significantly increased. It is better to use logged values.

acf(yuz_temizleyici$sold_count, na.action = na.pass)

Interestingly, we see a relatively high autocorrelation in lag 15. If we were to use time series decomposition, we may set the frequency as 15.

yuz_temizleyici[,log_sold:=log(sold_count)]
ggplot(yuz_temizleyici, aes(x=event_date))+
  geom_line(aes(y=log_sold))+
  labs(title="log(Sales) of La Roche over time")

3.6.1 Model 1 - ARIMA with external regressors

Code to test the model:

test_dates <- c(as.Date("2021-05-28"):as.Date("2021-06-11"))


for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- yuz_temizleyici[event_date<=current_date,]
  fitted_arima=arima(past_data$log_sold, xreg = past_data$price,order=c(2,1,4))
  forecasted=predict(fitted_arima,newxreg = yuz_temizleyici[event_date==test_dates[i],price],n.ahead = 2)
  yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i, Model1 := forecasted$pred[2]]
} 

t<-accu(yuz_temizleyici$sold_count[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))],exp(yuz_temizleyici$Model1[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))]))

3.6.2 Model 2 - ARIMA with external regressors

Differencing & arima

Code to test model:

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- yuz_temizleyici[event_date<=current_date,]
  fitted_arima=arima(diff(past_data$log_sold,15),order=c(1,0,0),seasonal = c(2,0,3)) 
  forecasted=predict(fitted_arima,n.ahead = 2)
  yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i,Model2:=(yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i-15,log_sold]+forecasted$pred[2])]
}

t2<-accu(yuz_temizleyici$sold_count[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))],exp(yuz_temizleyici$Model2[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))]))

3.6.3 Model 3 - Time Series Decomposition & ARIMA on random component

Code to test model:

for(i in 1:length(test_dates)){
  
  current_date=test_dates[i]-2
  
  past_data <- yuz_temizleyici[event_date<=current_date,]
  
  yuz_temizleyici_ts <- ts(past_data$log_sold, frequency = 15)
  yuz_temizleyici_decomposed <- decompose(yuz_temizleyici_ts,type = "additive")
  model <- arima(yuz_temizleyici_decomposed$random,order = c(3,0,2))
  forecasted=predict(model,n.ahead = 2)
  yuz_temizleyici[nrow(yuz_temizleyici)-length(test_dates)+i,Model3:=forecasted$pred[2]+yuz_temizleyici_decomposed$seasonal[(nrow(yuz_temizleyici)-length(test_dates)+i)%%15+15]+yuz_temizleyici_decomposed$trend[max(which(!is.na(yuz_temizleyici_decomposed$trend)))]]
}

t3<-accu(yuz_temizleyici$sold_count[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))],exp(yuz_temizleyici$Model3[(nrow(yuz_temizleyici)+1-length(test_dates)):(nrow(yuz_temizleyici))]))
yuz_temizleyici[,pred1:=exp(Model1)]
yuz_temizleyici[,pred2:=exp(Model2)]
yuz_temizleyici[,pred3:=exp(Model3)]

3.7 Results

ggplot(dis_fircasi,aes(x=event_date))+
  geom_line(aes(y=exp(Model1)), color="red")+
  xlim(as.Date("2021-05-28"),as.Date("2021-06-11"))+ #Update here
  geom_line(aes(y=sold_count))+
  geom_line(aes(y=exp(Model2)),color="blue")+
  labs(title="Oral B - Actual vs. Predicted over time")

rbind(c("Model 1", m),c("Model 2",m2))
##                n  mean     sd       CV        FBias      MAPE      RMSE    
## [1,] "Model 1" 15 149.4667 68.00826 0.4550062 -0.1106531 0.4136263 64.55335
## [2,] "Model 2" 15 149.4667 68.00826 0.4550062 0.08752205 0.2002466 52.63882
##      MAD      MADP      WMAPE    
## [1,] 55.23642 0.3695568 0.3695568
## [2,] 33.28247 0.2226749 0.2226749

Both models cannot capture when there is a sudden increase. But in other periods, especially model 2 works quite well.

ggplot(bebek_mendil,aes(x=event_date))+
  geom_line(aes(y=exp(Model1)), color="red")+
  xlim(as.Date("2021-05-28"),as.Date("2021-06-11"))+ #Update here
  geom_line(aes(y=sold_count))+
  geom_line(aes(y=exp(Model2)),color="blue")+
  geom_line(aes(y=exp(Model3)),color="green")+
  geom_line(aes(y=exp(Model4)),color="purple")+
  labs(title="Sleepy - Actual vs. Predicted over time")

rbind(c("Model 1", s),c("Model 2",s2),c("Model 3", s3),c("Model 4",s4))
##                n  mean     sd       CV        FBias       MAPE      RMSE    
## [1,] "Model 1" 15 542.1333 323.7585 0.5971935 0.01383865  0.2504832 218.5173
## [2,] "Model 2" 15 542.1333 323.7585 0.5971935 -0.05642248 0.2875008 224.6735
## [3,] "Model 3" 15 542.1333 323.7585 0.5971935 -0.04968033 0.2833153 214.2822
## [4,] "Model 4" 15 542.1333 323.7585 0.5971935 0.03450548  0.2351071 230.0444
##      MAD      MADP      WMAPE    
## [1,] 146.4193 0.27008   0.27008  
## [2,] 166.6595 0.3074142 0.3074142
## [3,] 160.6918 0.2964064 0.2964064
## [4,] 148.0539 0.273095  0.273095
ggplot(yuz_temizleyici,aes(x=event_date))+
  geom_line(aes(y=exp(Model1)), color="red")+
  xlim(as.Date("2021-05-28"),as.Date("2021-06-11"))+ #Update here
  geom_line(aes(y=sold_count))+
  geom_line(aes(y=exp(Model2)),color="blue")+
  geom_line(aes(y=exp(Model3)),color="green")+
  labs(title="La Roche - Actual vs. Predicted over time")

rbind(c("Model 1", t),c("Model 2",t2),c("Model 3", t3))
##                n  mean     sd       CV        FBias       MAPE      RMSE    
## [1,] "Model 1" 15 81.26667 18.35159 0.2258194 0.006983246 0.2617224 26.67892
## [2,] "Model 2" 15 81.26667 18.35159 0.2258194 -0.1644804  0.4585515 42.18491
## [3,] "Model 3" 15 81.26667 18.35159 0.2258194 -0.1602668  0.3232097 30.79468
##      MAD      MADP      WMAPE    
## [1,] 21.03962 0.2588961 0.2588961
## [2,] 33.25411 0.4091974 0.4091974
## [3,] 23.35136 0.2873424 0.2873424

3.8 Leopard Skin Bikini

## 
## -- Column specification --------------------------------------------------------
## cols(
##   event_date = col_date(format = ""),
##   product_content_id = col_double(),
##   price = col_double(),
##   sold_count = col_double(),
##   visit_count = col_double(),
##   basket_count = col_double(),
##   favored_count = col_double(),
##   category_sold = col_double(),
##   category_visits = col_double(),
##   category_basket = col_double(),
##   category_favored = col_double(),
##   category_brand_sold = col_double(),
##   ty_visits = col_double()
## )

Editing the data:

raw_data <- data.table(raw_data)
raw_data[, "event_date" := as.Date(event_date)]
raw_data <- raw_data[event_date >= '2020-05-25']
raw_data[, month := month(event_date, label = TRUE)]
raw_data[, wday := wday(event_date, label = TRUE)]
raw_data <- raw_data[order(event_date),]

The leopard skin bikini is the next product. When the nature of swim suits are considered, we expect them to sell more as the summer gets closer. During the winter, people do not tend to buy bikini as much. To be able to understand this product structure, we examine Trendyol’s Website and examine the positioning of this specific bikini top. Although the positioning has changed during the period, it is a semi-popular model in Trendyol. There is an alternative for this bikini in the website, the same top with the flower skin. We examine also the product page and check for the campaigns regularly. In our first observation we have noticed that there were a current stock out on this model, the remaining ones are only in size 40 and 42.

Reading the data:

data_leopar <- raw_data[product_content_id == 73318567] #leopar

To understand the trend of bikini better, the time series graph of the number of sold items should be examined.

ggplot(data_leopar, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Sales of Leopard Bikini") + xlab("day") + ylab("Sales")

There are lots of missing points in the data. And a pattern that we certainly would not accept from the bikini. During April, it has no sales. This is an unexpected situation and can impact our results. We believe that this is due to the stock loss or a planned decision. That’s why we decided to omit those period and focus after april to have a better understanding in the trend and not effect by this bizzarre situation.

To have a general understanding, we also check the autocorrelation and partial autocorrelation.

acf(data_leopar$sold_count)

pacf(data_leopar$sold_count)

We detect the general trend,the data depended on the previous one. However there is no significant daily trend at the first look.

In the model, we will use regressor in various ways, however in order to use the regressors in our predictions, we decided to use lagged variables. To decide which variables, we need to analyize the correlation plot. Than, we will omit the data after April and we do not loose any more data while lag shifting.

leopar <- data_leopar
numeric_data <- leopar
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   122 obs. of  11 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  1 2 2 2 1 1 5 3 6 1 ...
##  $ visit_count        : num  0 0 0 0 0 261 306 343 334 278 ...
##  $ basket_count       : num  4 11 13 7 2 7 25 28 35 17 ...
##  $ favored_count      : num  73 62 110 45 37 49 71 76 45 51 ...
##  $ category_sold      : num  124 156 219 148 134 175 457 372 360 229 ...
##  $ category_visits    : num  389 495 453 391 352 458 781 608 588 350 ...
##  $ category_basket    : num  0 0 0 0 0 ...
##  $ category_favored   : num  3246 3956 3427 2601 2167 ...
##  $ category_brand_sold: num  14526 13797 20377 8414 7112 ...
##  $ ty_visits          : num  1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is with basket_count, visit_count and favored_count. Thus we are adding their lagged version and examine also their correlation. We can than trim the data, and check for the combinations.

leopar <- leopar[,favored_count_lag := shift(favored_count, 2)]
leopar <- leopar[,basket_count_lag := shift(basket_count, 2)]
leopar <- leopar[,visit_count_lag := shift(visit_count, 2)]

leopar_reduced <- leopar[event_date >= '2021-04-29'] 

numeric_data <- leopar_reduced
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
## Warning in set(x, j = name, value = value): Column 'trend' does not exist to
## remove
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   64 obs. of  14 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  96 89 104 127 126 141 117 135 133 143 ...
##  $ visit_count        : num  3297 5122 9811 17089 13802 ...
##  $ basket_count       : num  316 347 555 688 632 542 507 537 540 638 ...
##  $ favored_count      : num  280 493 913 1982 1344 ...
##  $ category_sold      : num  2362 2367 2594 3063 3489 ...
##  $ category_visits    : num  3442 3849 3989 5728 9675 ...
##  $ category_basket    : num  412303 513576 654038 941317 1214826 ...
##  $ category_favored   : num  17204 21334 26379 35354 53824 ...
##  $ category_brand_sold: num  39275 49407 64542 102490 119304 ...
##  $ ty_visits          : num  1.65e+08 1.66e+08 1.67e+08 1.69e+08 1.69e+08 ...
##  $ favored_count_lag  : num  28 227 280 493 913 ...
##  $ basket_count_lag   : num  0 303 316 347 555 688 632 542 507 537 ...
##  $ visit_count_lag    : num  283 2657 3297 5122 9811 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The time series of the trimmed data and as the price is an important factor generally, however we did not see that effect in the correlation so we also plot the price.

ggplot(leopar_reduced, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Sales")) + 
  geom_line(aes(y = price, color="Price")) + scale_x_date(date_breaks = "1 day", date_labels = "%d") + ggtitle("Sales of Leopard Bikini") + xlab("day") + ylab("Sales")

The price does not change over the time and as we examine the results day by day, we see that not every kind of prices are reflected in the price section. We believe the effect of price is not reflecting in the data. As the discount for example at the basket does not included in the price.

We have observe also a very interesting trend here. Starting from June, the sales drop drastically which is not expecting as the summer arrives. We have observed that there is stock out currently. We assume that the stock-out start in June and its the result of the low sales. We added this infomation into the model and we built our first model with linear regressors. After building the model with regressors, we will add the arima to the residuals.

3.8.1 Model with Dynamic Regression

Adding the stock out factor and using it in a model:

leopar_reduced <- leopar_reduced[,stock_out := ifelse(sold_count < 75, 1, 0)]

leopar_reduced$stock_out <- as.factor(leopar_reduced$stock_out)

model1=lm(sold_count~ stock_out, leopar_reduced)

summary(model1)
## 
## Call:
## lm(formula = sold_count ~ stock_out, data = leopar_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.472 -21.472  -5.933  19.837 131.528 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  154.472      6.260   24.68   <2e-16 ***
## stock_out1  -115.079      9.464  -12.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.56 on 62 degrees of freedom
## Multiple R-squared:  0.7045, Adjusted R-squared:  0.6998 
## F-statistic: 147.8 on 1 and 62 DF,  p-value: < 2.2e-16

The adjusted R square performed well, let’s add the correlated lagged variables one by one to see whether they improve the model or not.

model1=lm(sold_count~ stock_out + basket_count_lag, leopar_reduced)

summary(model1)
## 
## Call:
## lm(formula = sold_count ~ stock_out + basket_count_lag, data = leopar_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.738 -19.619  -5.261  11.234 121.265 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      110.43987   13.93404   7.926  5.9e-11 ***
## stock_out1       -85.68504   12.15428  -7.050  1.9e-09 ***
## basket_count_lag   0.06549    0.01887   3.471 0.000957 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.6 on 61 degrees of freedom
## Multiple R-squared:  0.7533, Adjusted R-squared:  0.7452 
## F-statistic: 93.12 on 2 and 61 DF,  p-value: < 2.2e-16

The adjusted R squared increased, ading the second:

model1=lm(sold_count~ stock_out + basket_count_lag + visit_count_lag, leopar_reduced)

summary(model1)
## 
## Call:
## lm(formula = sold_count ~ stock_out + basket_count_lag + visit_count_lag, 
##     data = leopar_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.603 -19.576  -3.327  13.796 116.187 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      114.263203  14.075033   8.118 3.08e-11 ***
## stock_out1       -96.249470  14.146669  -6.804 5.40e-09 ***
## basket_count_lag   0.002515   0.047957   0.052    0.958    
## visit_count_lag    0.002569   0.001801   1.426    0.159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.31 on 60 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7494 
## F-statistic: 63.81 on 3 and 60 DF,  p-value: < 2.2e-16

Again the adjusted R squared has increased. Adding the third:

model1=lm(sold_count ~ stock_out + basket_count_lag + visit_count_lag + favored_count_lag, leopar_reduced)
summary(model1)
## 
## Call:
## lm(formula = sold_count ~ stock_out + basket_count_lag + visit_count_lag + 
##     favored_count_lag, data = leopar_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.358 -18.585  -2.672  14.267  73.436 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.552e+01  1.271e+01   7.515 3.63e-10 ***
## stock_out1        -1.027e+02  1.222e+01  -8.408 1.12e-11 ***
## basket_count_lag   1.130e-02  4.120e-02   0.274    0.785    
## visit_count_lag    1.266e-02  2.633e-03   4.810 1.08e-05 ***
## favored_count_lag -9.896e-02  2.089e-02  -4.738 1.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.45 on 59 degrees of freedom
## Multiple R-squared:  0.8271, Adjusted R-squared:  0.8154 
## F-statistic: 70.58 on 4 and 59 DF,  p-value: < 2.2e-16

This also has improved the model. Let’s add wday to see the week of days have an imprvement on the model.

leopar_reduced$wday <- as.factor(leopar_reduced$wday)
model=lm(sold_count ~ stock_out + basket_count_lag + visit_count_lag + favored_count_lag + wday, leopar_reduced)

summary(model)
## 
## Call:
## lm(formula = sold_count ~ stock_out + basket_count_lag + visit_count_lag + 
##     favored_count_lag + wday, data = leopar_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.510 -16.128  -3.915  19.673  64.444 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.663e+01  1.310e+01   7.374 1.13e-09 ***
## stock_out1        -1.051e+02  1.294e+01  -8.121 7.17e-11 ***
## basket_count_lag  -4.650e-03  4.478e-02  -0.104  0.91769    
## visit_count_lag    1.300e-02  2.838e-03   4.581 2.85e-05 ***
## favored_count_lag -9.549e-02  2.313e-02  -4.128  0.00013 ***
## wday.L            -8.030e+00  1.030e+01  -0.780  0.43908    
## wday.Q             2.187e+01  1.038e+01   2.107  0.03988 *  
## wday.C             4.513e+00  1.052e+01   0.429  0.66958    
## wday^4             4.551e+00  1.017e+01   0.447  0.65647    
## wday^5             3.359e+00  9.801e+00   0.343  0.73314    
## wday^6            -1.557e+00  1.016e+01  -0.153  0.87879    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.43 on 53 degrees of freedom
## Multiple R-squared:  0.8449, Adjusted R-squared:  0.8156 
## F-statistic: 28.87 on 10 and 53 DF,  p-value: < 2.2e-16

The adjusted R squared remains the same, we won’t use it. We need to check the residuals for our final regressor model.

plot(model1$residuals)

checkresiduals(model1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 21.293, df = 10, p-value = 0.01914

The residuals seem randomly distributed and normalized from the graph. We can continue with this methodolojy.

To build the arima model, we need to examine autocorrelation and partial auto correlation should be examined.

modellastts <- (model1$residuals)
acf(modellastts)

pacf(modellastts)

The autocorrelation plot is decreasing. We see significant partial auto correlation at lag 4 and 1. Let’s check auto.arima suggestion:

auto.arima(modellastts)
## Series: modellastts 
## ARIMA(0,0,1) with zero mean 
## 
## Coefficients:
##          ma1
##       0.4289
## s.e.  0.1479
## 
## sigma^2 estimated as 715.2:  log likelihood=-300.73
## AIC=605.46   AICc=605.66   BIC=609.78
modelarima1 <- arima(modellastts, order=c(1,0,0))
print(modelarima1)
## 
## Call:
## arima(x = modellastts, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.2955    -0.1584
## s.e.  0.1189     4.7587
## 
## sigma^2 estimated as 728.4:  log likelihood = -301.77,  aic = 609.53

Our approach AR(4) as there is significant value there:

modelarima <- arima(modellastts, order=c(4,0,0))
print(modelarima)
## 
## Call:
## arima(x = modellastts, order = c(4, 0, 0))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4  intercept
##       0.3649  -0.2000  0.1702  -0.3013     0.2615
## s.e.  0.1181   0.1255  0.1238   0.1166     3.3230
## 
## sigma^2 estimated as 641.9:  log likelihood = -297.93,  aic = 607.87

This performed worse than the arima suggestion, let’s add also suggested MA into the model.

modelarima <- arima(modellastts, order=c(4,0,1))
print(modelarima)
## 
## Call:
## arima(x = modellastts, order = c(4, 0, 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1  intercept
##       0.4198  -0.2191  0.1783  -0.3046  -0.0612     0.2874
## s.e.  0.8579   0.3201  0.1753   0.1240   0.9466     3.2861
## 
## sigma^2 estimated as 641.8:  log likelihood = -297.93,  aic = 609.86

The AIC is higher, the performance has lowered. We will use (0,0,1) for our analysis.

predicted1 = modellastts - modelarima1$residuals 

leopar_reduced = leopar_reduced[, predictedlinear := leopar_reduced$sold_count - predicted1]

ggplot(leopar_reduced, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Sales")) + 
  geom_line(aes(y = predictedlinear, color="Prediction")) + scale_x_date(date_breaks = "1 day", date_labels = "%d")+ ggtitle("Sales of Leopard Bikini") + xlab("day") + ylab("Sales")

This model seems to be able to catch the trend.

To evaluate the performance we need to estimate the test period and check the MWAPE.

test_start=as.Date('2021-06-11')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')
leopar_reduced$Model1 =NULL
## Warning in set(x, j = name, value = value): Column 'Model1' does not exist to
## remove
for(i in 1:length(test_dates)){
  current_date=test_dates[i]-2
  past_data <- leopar_reduced[event_date<=current_date,]
  fitted_lm= lm(sold_count ~ stock_out, past_data)
  rm <- arima(fitted_lm$residual, order = c(0,0,1))
  forecasted=predict(fitted_lm,leopar_reduced[event_date == test_dates[i],])
  leopar_reduced[nrow(leopar_reduced)-length(test_dates)+i, Model1 := (forecasted+forecast(rm,h=2)$mean[2])]
  
}

  m_with_regression<-accu(leopar_reduced$sold_count[(nrow(leopar_reduced)-length(test_dates)-1):(nrow(leopar_reduced)-2)],(leopar_reduced$Model1[(nrow(leopar_reduced)+1-length(test_dates)):(nrow(leopar_reduced)-0)])) 

3.8.2 Models Arima and Arimax

For the second model, we will first decompose the series and build an ARIMA model on the decomposed series.

Checking the autocorrelation and partial autocorrelation:

acf(leopar_reduced$sold_count, main= "Daily Autocorrelation")

pacf(leopar_reduced$sold_count, main= "Daily Autocorrelation")

The autocorrelation plot indicates that there is a trend as discussed at the analysis part. Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why, we will make the decomposition daily and have the frequency of 7. As the variance change over the time, we will make a multiplicative decomposition firstly and according to the results make an additive model.

leoparts <- ts(leopar_reduced$sold_count,freq=7)
data_mult<-decompose(leoparts,type="multiplicative")
random=data_mult$random
plot(data_mult)

The trend is increasing as the summer time arrives, however we can see a sharp decrease in June due to the stock out of smaller sizes. When we examine the random part of the data, it seems pretty stationary with near to stable variance and mean.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday   m_sold
## 1:  Mon 16.10345
## 2:  Tue 17.22414
## 3:  Wed 18.56897
## 4:  Thu 18.41379
## 5:  Fri 17.22807
## 6:  Sat 20.78947
## 7:  Sun 20.50877

To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sold of each day. Saturdays and Sundays are the maximum points and the trend also reflect it

Having higher sales on the weekends are normal, as people have time to shop more.

To check the stationarity of the detrend part we use KPSS test

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.0438 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0438 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

We will also check the additive decomposition.

leoparts <- ts(leopar_reduced$sold_count,freq=7)
data_add<-decompose(leoparts,type="additive")
random1=data_add$random
plot(data_add)

The trend is very similar to the additive version.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

mean_sold=data_leopar[,list(m_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday   m_sold
## 1:  Mon 16.10345
## 2:  Tue 17.22414
## 3:  Wed 18.56897
## 4:  Thu 18.41379
## 5:  Fri 17.22807
## 6:  Sat 20.78947
## 7:  Sun 20.50877

In the additive model, the multipliers of days not very similar to the mean sales of each days. The highest points are on Wednesdays and Thurdays, this model did not reflect the trend very well.

To check the stationarity of the detrend part we use KPSS test again

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 3 lags. 
## 
## Value of test-statistic is: 0.0353 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0353 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. This is larger than the multiplicative decomposition so we will continue my analysis with the a multiplicative model.

To decide the arima, we need to check the autocorrelation and the partial auto correlation function.

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 3. That’s why AR(3) will be tried first.

model <- arima(random, order=c(3,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       0.1665  -0.1494  -0.3194     0.9938
## s.e.  0.1254   0.1246   0.1231     0.0225
## 
## sigma^2 estimated as 0.0483:  log likelihood = 5.35,  aic = -0.7

Also, the partial autocorrelaton plot seems to be decreasing and there is significant autocorrelation is at lag 3 at the autocorrelation plot. Let’s also try MA(3).

modelf <- arima(random, order=c(0,0,3))
print(modelf)
## 
## Call:
## arima(x = random, order = c(0, 0, 3))
## 
## Coefficients:
##           ma1      ma2      ma3  intercept
##       -0.0128  -0.5068  -0.4804     0.9871
## s.e.   0.1491   0.1134   0.1058     0.0038
## 
## sigma^2 estimated as 0.04304:  log likelihood = 7.22,  aic = -4.44

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(3,0,3))
print(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 3))
## 
## Coefficients:
##           ar1     ar2     ar3     ma1      ma2      ma3  intercept
##       -0.5538  0.3311  0.1961  0.6761  -0.7420  -0.9341     0.9881
## s.e.   0.1562  0.1582  0.1561  0.1247   0.1046   0.1343     0.0049
## 
## sigma^2 estimated as 0.03805:  log likelihood = 9.84,  aic = -3.69

The model has worsened so we will use 0,0,3

I will use two model only arima and arima with regressors

3.8.2.1 Arimax

We know which parameter can be used as a regressor from the first part.

leopar_reduced = leopar_reduced[,random := data_mult$random]
reg_matrix=cbind( leopar_reduced$stock_out) # can add more if any other regressors exist
   model1 <- arima(leopar_reduced$random, order = c(0,0,3), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = leopar_reduced$random, order = c(0, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
##           ma1      ma2      ma3  intercept  reg_matrix
##       -0.0362  -0.5017  -0.4620     1.0174     -0.0217
## s.e.   0.1435   0.1039   0.1048     0.0212      0.0149
## 
## sigma^2 estimated as 0.04157:  log likelihood = 8.24,  aic = -4.48
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.008391466 0.2038985 0.1594109 -4.457244 18.18523 0.6562995
##                    ACF1
## Training set 0.01573044

When have a slightly improvement, let’s add others. I added one by one and the lowest AIC values was with these values.

reg_matrix=cbind( leopar_reduced$stock_out, leopar_reduced$basket_count_lag, leopar_reduced$visit_count_lag) # can add more if any other regressors exist
   modelwithreg <- arima(leopar_reduced$random, order = c(0,0,3), xreg = reg_matrix)
  summary(modelwithreg)
## 
## Call:
## arima(x = leopar_reduced$random, order = c(0, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
##           ma1      ma2      ma3  intercept  reg_matrix1  reg_matrix2
##       -0.1303  -0.4326  -0.4371     1.2916      -0.1557       -6e-04
## s.e.   0.1469   0.1254   0.0993     0.2215       0.1044        7e-04
##       reg_matrix3
##                 0
## s.e.            0
## 
## sigma^2 estimated as 0.03816:  log likelihood = 10.76,  aic = -5.51
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.004137544 0.1953563 0.1611623 -4.395967 18.36625 0.6635101
##                    ACF1
## Training set 0.01891687
model_fitted <- leopar_reduced$random - residuals(modelwithreg)
leopar_reduced <- cbind(leopar_reduced, data_mult$seasonal, data_mult$trend, model_fitted)
leopar_reduced <-leopar_reduced[,predictwithreg := data_mult$seasonal * data_mult$trend * model_fitted] 

model_fitted_2 <- leopar_reduced$random - residuals(modelf)
leopar_reduced <- cbind(leopar_reduced, model_fitted_2)
leopar_reduced <-leopar_reduced[,predictonlyarima := data_mult$seasonal * data_mult$trend * model_fitted_2] 


ggplot(leopar_reduced, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Actual Sales" )) +
  geom_line(aes(y =predictwithreg, color = "ARIMAX Prediction" ))  +
   geom_line(aes(y = predictonlyarima, color = "ARIMA Predcition")) +
    geom_line(aes(y = predictedlinear, color = "Linear Predcition")) + ggtitle("Sales of Leopard Bikini") + xlab("day") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

Except for the peak points, the data seems to catch the trend well. As we can see from the plot both model yields pretty similar results.

Let’s compare all models performance

We need to seperate the test and train data. We will use last 7 days to make the prediction. In order to compare the model we will examine both arima with regressor and arima stand alone.

Arima with regressor:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-11')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- leopar_reduced[event_date<=current_date,]
    
    leopar_ts <- ts(past_data$sold_count, frequency = 7)  
    leopar_decomposed <- decompose(leopar_ts, type="multiplicative")
    
    reg_matrix=cbind( past_data$stock_out, past_data$basket_count_lag, past_data$visit_count_lag) 
    
    model <- arima(leopar_decomposed$random,order = c(0,0,3),xreg = reg_matrix)
    
    forecasted=predict(model,n.ahead = 2,newxreg = leopar_reduced[event_date==test_dates[i],cbind(stock_out,basket_count_lag,visit_count_lag)])
    leopar_reduced[nrow(leopar_reduced)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]*leopar_decomposed$seasonal[(nrow(leopar_reduced)-length(test_dates)+i-2)%%7+7]*leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
  }
  m_with_reg<-accu(leopar_reduced$sold_count[(nrow(leopar_reduced)-1-length(test_dates)):(nrow(leopar_reduced)-2)],(leopar_reduced$Model_reg[(nrow(leopar_reduced)-1-length(test_dates)):(nrow(leopar_reduced)-2)]))  

Arima without regressor:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- leopar_reduced[event_date<=current_date,]
    
    leopar_ts <- ts(past_data$sold_count, frequency = 7)  
    leopar_decomposed <- decompose(leopar_ts, type="multiplicative")
    model <- arima(leopar_decomposed$random,order = c(0,0,3))
    
    forecasted=predict(model,n.ahead = 2)
    leopar_reduced[nrow(leopar_reduced)-length(test_dates)+i-2, Model_nolag := forecasted$pred[2]*leopar_decomposed$seasonal[(nrow(leopar_reduced)-length(test_dates)+i-2)%%7+7]*leopar_decomposed$trend[max(which(!is.na(leopar_decomposed$trend)))]]
  }
  m_only_arima<-accu(leopar_reduced$sold_count[(nrow(leopar_reduced)-1-length(test_dates)):(nrow(leopar_reduced)-2)],(leopar_reduced$Model_nolag[(nrow(leopar_reduced)-1-length(test_dates)):(nrow(leopar_reduced)-2)]))  
rbind(m_with_reg,m_only_arima, m_with_regression)
##    n  mean       sd        CV       FBias      MAPE     RMSE      MAD      MADP
## 1 20 38.85 17.37595 0.4472574 -0.03222574 0.4708705 17.90544 13.60894 0.3502944
## 2 20 38.85 17.37595 0.4472574 -0.10978623 0.4468880 16.79532 12.12696 0.3121482
## 3 20 38.85 17.37595 0.4472574 -0.21076893 0.5792228 17.89721 15.22387 0.3918627
##       WMAPE
## 1 0.3502944
## 2 0.3121482
## 3 0.3918627

Although regressor + arima model seems to catch the trend better, when tested on the ungiven period. Regrssion + arima yield better resılts. When we compare the ARIMA results each other, ARIMA without regressors seem to be performed slighly better. However, this is not a significant difference.

In the analysis theie combinations will be used

3.9 Black Bikini

The Black bikini is the next product. When the nature of swim suits are considered, we expect them to sell more as the summer gets closer. During the winter, people do not tend to buy bikini as much. To be able to understand this product structure, we examine Trendyol’s Website and examine the positioning of this specific bikini top.

Although the positioning has changed during the period, it is a semi-popular model in Trendyol. There are many black bikini tops in Trendyol. We examine also the product page and check for the campaigns regularly. In our first observation we have noticed that there were a current stock out on this model on the larger sizes,

Reading the data:

data_siyah <- raw_data[product_content_id == 32737302] 

To understand the trend, the time series graph of the number of sold items should be examined.

ggplot(data_siyah, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Sales of Black Bikini") + xlab("day") + ylab("Sales")

There are lots of missing points in the data. As we don’t have a full year data, we cannot check for a monthly trend. We will check for a daily trend and examine the autocorrelation plot.

acf(data_siyah$sold_count, main= "Daily Autocorrelation")

siyah <- data_siyah

The data told us there is a trend, however any specific lag does not stand out. We do not omit N/A’s in order not to loose any information.

Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will first make a multiplicative decomposition.

The 0’s can create complexity for the model. That’s why we trim the model as we trim the Leopard Bikini.

siyah <- siyah[event_date >= '2021-02-20'] 

Ploti-ting the trimmed version:

ggplot(siyah, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Sales of Black Bikini") + xlab("day") + ylab("Sales")

As there is an obvious changing variance in the graph, we will make our analysis with the log version.

acf(log(siyah$sold_count), main= "Daily Autocorrelation")

pacf(log(siyah$sold_count), main= "Daily Autocorrelation")

As can be seen from the trend data, at lag 10 it is at its max. From partial autocorrelation at lag 7, we can see its above the limit. Weekly trend makes sense that’s why we are starting our data analysis, with that.

We will try both multiplicative and additive model than decide the best fitted one for the model.

siyahts <- ts(log(siyah$sold_count),freq=7)
data_mult<-decompose(siyahts,type="multiplicative")
random=data_mult$random
plot(data_mult)

The trend is increasing as the summer time arrives, the general trend catches it. In a general look it looks stationary.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

To understand the multiplicative decomposition, we examine the multipliers of the days. To be able to compare the multiplicative behaviours and data, we also examine the mean sales of each day. However, although the mean very high on Wednesdays, the multiplier of Wednesdays are low.

Having higher sales on the early weekdays are normal, as people want the delivery before the end of the week.

mean_sold=siyah[,list(mean_sold = mean(log(sold_count),na.rm=T)), by="wday"]
mean_sold
##    wday mean_sold
## 1:  Sat  3.322176
## 2:  Sun  3.534468
## 3:  Mon  3.470483
## 4:  Tue  3.615199
## 5:  Wed  3.646734
## 6:  Thu  3.580097
## 7:  Fri  3.527455

To check the stationarity of the detrend part we use KPSS test

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0266 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0266 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

The multipliers of the multiplicative model were not satisfying, we will also check the additive decomposition.

siyahts <- ts(log(siyah$sold_count),freq=7)
data_add<-decompose(siyahts,type="additive")
random=data_add$random
plot(data_add)

The trend is very similar to the additive version.

In this graph it is hard to identify seasonal daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

They yield very similar results with the additive model.

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0356 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0356 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

The results are pretty similar, that’s why we choose to continue with additive model.

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation in lag 6. That’s why AR(6) will be tried first.

model <- arima(random, order=c(6,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(6, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5      ar6  intercept
##       0.0734  -0.3398  -0.3592  -0.0689  -0.2242  -0.2432     0.0047
## s.e.  0.0861   0.0841   0.0892   0.0887   0.0834   0.0872     0.0107
## 
## sigma^2 estimated as 0.06485:  log likelihood = -7,  aic = 29.99

Also, the partial autocorrelaton plot is decreasing and max autocorrelation is at lag 3. Let’s also try MA(3).

modelf <- arima(random, order=c(0,0,3))
print(modelf)
## 
## Call:
## arima(x = random, order = c(0, 0, 3))
## 
## Coefficients:
##           ma1      ma2      ma3  intercept
##       -0.0529  -0.4762  -0.4708     0.0007
## s.e.   0.0817   0.0699   0.0857     0.0014
## 
## sigma^2 estimated as 0.05785:  log likelihood = -1.31,  aic = 12.61

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(6,0,3))
print(model)
## 
## Call:
## arima(x = random, order = c(6, 0, 3))
## 
## Coefficients:
##          ar1     ar2      ar3     ar4      ar5      ar6      ma1      ma2
##       0.4007  0.2178  -0.6260  0.2164  -0.1480  -0.2637  -0.6526  -0.7734
## s.e.  0.3719  0.3844   0.2814  0.1374   0.0979   0.1355   0.3870   0.4448
##          ma3  intercept
##       0.4260      0e+00
## s.e.  0.3111      5e-04
## 
## sigma^2 estimated as 0.04885:  log likelihood = 7.73,  aic = 6.53

The model improved, we will continue with the neighbors.

modelf <- arima(random, order=c(6,0,4))
## Warning in arima(random, order = c(6, 0, 4)): possible convergence problem:
## optim gave code = 1
print(modelf)
## 
## Call:
## arima(x = random, order = c(6, 0, 4))
## 
## Coefficients:
##           ar1     ar2      ar3      ar4     ar5      ar6     ma1      ma2
##       -0.2177  0.8111  -0.2920  -0.3718  0.1346  -0.2844  0.0488  -1.6487
## s.e.   0.1298  0.1493   0.1129   0.1107  0.1284   0.0995  0.1424   0.1254
##           ma3     ma4  intercept
##       -0.2440  0.8441     -2e-04
## s.e.   0.1315  0.1375      3e-04
## 
## sigma^2 estimated as 0.04486:  log likelihood = 10.47,  aic = 3.05

It has improved again so we will choose (6,0,4) for the arima models.

To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.

Than, the lagged variable will be adwded to the model as for the predictions we need to have the data.

siyah <- siyah[, random := data_add$random]
numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   126 obs. of  12 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  18 34 30 39 49 81 66 50 12 28 ...
##  $ visit_count        : num  1849 2884 4900 6435 8192 ...
##  $ basket_count       : num  139 186 277 338 429 601 428 249 191 251 ...
##  $ favored_count      : num  332 560 1097 1366 1902 ...
##  $ category_sold      : num  452 672 689 1184 1984 ...
##  $ category_visits    : num  605 837 855 1359 2242 ...
##  $ category_basket    : num  180869 230145 262976 496262 833757 ...
##  $ category_favored   : num  6054 8207 7876 13106 20913 ...
##  $ category_brand_sold: num  23915 30570 40666 86998 149519 ...
##  $ ty_visits          : num  9.51e+07 1.01e+08 9.37e+07 8.79e+07 1.14e+08 ...
##  $ random             : num  -0.0665 0.324 -0.0746 -0.0146 0.2249 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is basket_count, visit_count and the category_sold. To be able to use them in our prediction, we are adding the lagged versions.

For the price it does not have true correlation, as the data does not contain all the discount types.

siyah <- siyah[,category_sold_lag := shift(category_sold, 2)]
siyah <- siyah[,basket_count_lag := shift(basket_count, 2)]
siyah <- siyah[,visit_count_lag := shift(visit_count, 2)]

numeric_data <- siyah
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
## Warning in set(x, j = name, value = value): Column 'trend' does not exist to
## remove
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   126 obs. of  15 variables:
##  $ price              : num  60 60 60 60 60 ...
##  $ sold_count         : num  18 34 30 39 49 81 66 50 12 28 ...
##  $ visit_count        : num  1849 2884 4900 6435 8192 ...
##  $ basket_count       : num  139 186 277 338 429 601 428 249 191 251 ...
##  $ favored_count      : num  332 560 1097 1366 1902 ...
##  $ category_sold      : num  452 672 689 1184 1984 ...
##  $ category_visits    : num  605 837 855 1359 2242 ...
##  $ category_basket    : num  180869 230145 262976 496262 833757 ...
##  $ category_favored   : num  6054 8207 7876 13106 20913 ...
##  $ category_brand_sold: num  23915 30570 40666 86998 149519 ...
##  $ ty_visits          : num  9.51e+07 1.01e+08 9.37e+07 8.79e+07 1.14e+08 ...
##  $ random             : num  -0.0665 0.324 -0.0746 -0.0146 0.2249 ...
##  $ category_sold_lag  : num  645 431 452 672 689 ...
##  $ basket_count_lag   : num  63 101 139 186 277 338 429 601 428 249 ...
##  $ visit_count_lag    : num  909 1336 1849 2884 4900 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The correlations with lagged variables are all very closed to each other. Let’s add them one by one to see the improvement.

reg_matrix=cbind( siyah$basket_count_lag)
   model1 <- arima(siyah$random,order = c(6,0,4), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = siyah$random, order = c(6, 0, 4), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3      ar4      ar5      ar6      ma1      ma2
##       0.4948  0.0127  -0.0041  -0.2236  -0.0121  -0.1573  -0.8720  -0.6105
## s.e.  0.3606  0.5192   0.5245   0.3567   0.1698   0.1476   0.3448   0.6227
##           ma3     ma4  intercept  reg_matrix
##       -0.1578  0.6426    -0.0062           0
## s.e.   0.5793  0.3169     0.0001         NaN
## 
## sigma^2 estimated as 0.04246:  log likelihood = 13.13,  aic = -0.27
## 
## Training set error measures:
##                     ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0298191 0.2060568 0.1616004 -212.3694 424.6147 0.5381477
##                      ACF1
## Training set -0.008966352

The AIC has improve when we add the lagged variable.

Adding the second:

reg_matrix=cbind( siyah$visit_count_lag, siyah$basket_count_lag) 

   model2 <- arima(siyah$random,order = c(6,0,4), xreg = reg_matrix)
  summary(model2)
## 
## Call:
## arima(x = siyah$random, order = c(6, 0, 4), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2     ar3     ar4     ar5      ar6     ma1      ma2     ma3
##       0.3191  0.3316  -0.311  -0.103  0.0061  -0.2218  -0.724  -0.9922  0.1621
## s.e.     NaN     NaN     NaN   0.206  0.1065   0.0457     NaN      NaN     NaN
##          ma4  intercept  reg_matrix1  reg_matrix2
##       0.5567    -0.0082            0        1e-04
## s.e.  0.2328     0.0001            0        1e-04
## 
## sigma^2 estimated as 0.04181:  log likelihood = 14.01,  aic = -0.02
## 
## Training set error measures:
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.02551946 0.2044816 0.1584082 -162.5614 388.7773 0.5275174
##                     ACF1
## Training set 0.003726793

AIC did not improved.

model_fitted <- siyah$random - residuals(model1)
siyah <- cbind(siyah, data_add$seasonal, data_add$trend, model_fitted)
siyah <-siyah[,predictarima := exp(data_add$seasonal + data_add$trend + model_fitted) ] 

model_fitted_2 <- siyah$random - residuals(modelf)
siyah <- cbind(siyah, model_fitted_2)
siyah <-siyah[,predictonlyar := exp(data_add$seasonal + data_add$trend + model_fitted_2)] 

ggplot(siyah, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Actual Sales")) + 
  geom_line(aes(y = predictarima, color="ARIMA Prediction")) +
  geom_line(aes(y = predictonlyar, color="ARIMAX Prediction")) + ggtitle("Sales of Black Bikini") + xlab("day") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

The AIC is higher, thus the performance is lowered with the addition of every regressors. Visit count lag has the lowest AIC values, we will make the prediction with it and only arima ad compare the results.

Testing with regressor:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-24')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- siyah[event_date<=current_date,]
    
    siyah_ts <- ts(log(past_data$sold_count), frequency = 7)  
    siyah_decomposed <- decompose(siyah_ts, type="additive")
    model <- arima(siyah_decomposed$random,order = c(6,0,4),xreg = past_data$basket_count_lag)
    
    forecasted=predict(model,n.ahead = 2,newxreg = siyah[event_date==test_dates[i],basket_count_lag])
    siyah[nrow(siyah)-length(test_dates)+i-2, Model_reg :=  forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
}
siyah$Model_reg <- exp(siyah$Model_reg)
  m_with_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))  

Testing with only arima:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- siyah[event_date<=current_date,]
    
    siyah_ts <- ts(log(past_data$sold_count), frequency = 7)  
    siyah_decomposed <- decompose(siyah_ts, type="additive")
    model <- arima(siyah_decomposed$random,order = c(6,0,4))
    
    forecasted=predict(model,n.ahead = 2)
    siyah[nrow(siyah)-length(test_dates)+i-2, Model_no_reg := forecasted$pred[2]+siyah_decomposed$seasonal[(nrow(siyah)-length(test_dates)+i-2)%%7+7]+siyah_decomposed$trend[max(which(!is.na(siyah_decomposed$trend)))]]
}
siyah$Model_no_reg <- exp(siyah$Model_no_reg)
  m_without_lag<-accu(siyah$sold_count[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)],(siyah$Model_no_reg[(nrow(siyah)-1-length(test_dates)):(nrow(siyah)-2)]))  
rbind( m_with_lag, m_without_lag )
##   n mean       sd        CV      FBias      MAPE     RMSE      MAD      MADP
## 1 7   54 15.05545 0.2788047 -0.1739556 0.2489667 14.30987 12.74477 0.2360143
## 2 7   54 15.05545 0.2788047 -0.2539448 0.3446689 19.47286 17.42185 0.3226269
##       WMAPE
## 1 0.2360143
## 2 0.3226269

The model performed much better when lagged are included.

3.10 Leggings

The next product is the leggings. Leggings have a large portfolio at Trendyol. Our product is one of the popular leggings. Leggings can be worn in any seasons, but we can expect more sales when the wheather is not too cold or too hot, during spring or autumn.

The product depends heavily on the price, however, as the price here do not involve every discount, the relationship is not clear on the data.

data_tayt <- raw_data[product_content_id == 31515569]

To have a better understanding, we plot the time series of the legging sales

ggplot(data_tayt, aes(x=event_date)) + 
  geom_line(aes(y = sold_count), color = "red") + ggtitle("Sales of Leggings") + xlab("day") + ylab("Sales")

We will check for a daily trend, we do not have enough data to check monthly or weekly trend. We expect a 7 days frequency, to be able to see the seasonlity we will control autocorrelation.

acf(data_tayt$sold_count, main= "Daily Autocorrelation")

pacf(data_tayt$sold_count, main= "Daily Autocorrelation")

We have a peak at lag 16. We will use it in the decomposition. We will also check 7 days decomposition and compare the two according to their

In order not to lose information, we do not delete N/A’s, the days with 0 sales

Although we did not see any peak point at day seven, it is logical to have a daily trend. That’s why we will make the decomposition daily. As the variance change over the time, we will make a multiplicative decomposition. As there is a peak at lag 16, we will also try decomposition at lag 16.

tayt <- data_tayt
taytts <- ts(tayt$sold_count,freq=7)
data_mult<-decompose(taytts,type="multiplicative")
random=data_mult$random
plot(data_mult)

We have peaks in the, there are peak sales in the November. The leggings can be preffered to be weared in the autumun. This can be also the low prices in tht time. The trend part of the decomposition seems to catch the overall trend.

When we examine the detrend part, we see a stationary data with a constant mean and variance.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_mult$seasonal[1:7], xlab = "Hour", ylab = "Multiplicative of Day", main = "Seasonal Component of Trend for Multiplicative")

To compare the seasonality multipliers with sales mean of that days, we examine the mean sales:

mean_sold=tayt[,list(mean_sold = mean(sold_count,na.rm=T)), by="wday"]
mean_sold
##    wday mean_sold
## 1:  Mon  811.1379
## 2:  Tue  940.0000
## 3:  Wed 1034.5345
## 4:  Thu 1039.1897
## 5:  Fri  696.5789
## 6:  Sat  580.2807
## 7:  Sun  608.4035

The seasonality seems to reflect

To check the stationarity of the detrend part

unt_test=ur.kpss(data_mult$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1184 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.1325 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary.

We will also control the additive decomposition to see whether it performs better.

taytts <- ts(tayt$sold_count,freq=7)
data_add<-decompose(taytts,type="additive")
random=data_add$random
plot(data_add)

The trend part of the decomposition seems to catch the overall trend weel.

When we examine the detrend part, we see a stationary data with a constant mean and variance. When compared to multiplicative the mean is smaller and the outlier peak can be identified in the detrend part.

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_add$seasonal[1:7], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

The seasonality seems to reflect the mean valeus of the days, the seasonality is pretty similar to the multiplicative decomposition results.

To check the stationarity of the detrend part

unt_test=ur.kpss(data_add$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0063 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The data is stationary, because 0.0063 is lower even 0.347. Differencing results in 0.0519<0.347 implies corresponding the p-value is larger than 10%, we fail to reject the null hypothesis claiming the data are stationary. This is also smaller than the multiplicative decomposition, that’s why we will continue with the additive decomposition.

As the autocorrelation plot gives higher values at lag 16, we will examine the data. We will use addiitive decomposition as it performmed better for the decomposion with 7 days frequency.

taytts <- ts(tayt$sold_count,freq=16)
data_add_2<-decompose(taytts,type="additive")
plot(data_add_2)

During the peak time, it seems like the data catches the decrease later. Random part seems similar to other model

In this graph it is hard to identify seasonal trend, daily trend, so we need to have a closer look.

plot(data_add_2$seasonal[1:16], xlab = "Hour", ylab = "Additive of Day", main = "Seasonal Component of Trend for Additive")

Seasonality seems to have the peaks at the second weeks Tuesdays and Wednesdays, however there is 16 days, we are tracking of the days of week in this decomposition.

To check the stationarity of the detrend part

unt_test=ur.kpss(data_add_2$random) 
summary(unt_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0056 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The random part is also stationary, however we will continue with our analysis with 7 days decomposition.

3.10.1 Arima and Arimax Models

acf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

pacf(random, na.action = na.pass, main= "Detrend's Autocorrelation")

We can see that the autocorrelation plot is decreasing. We can see the significant partial autocorrelation at lag 4. That’s why AR(4) will be tried first.

model <- arima(random, order=c(4,0,0))
print(model)
## 
## Call:
## arima(x = random, order = c(4, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3      ar4  intercept
##       0.2668  -0.1144  -0.1979  -0.2597    -0.0114
## s.e.  0.0485   0.0494   0.0493   0.0484    18.6062
## 
## sigma^2 estimated as 232696:  log likelihood = -3016.65,  aic = 6045.3

Also, the partial autocorrelaton plot is decreasing and significant autocorrelation is at lag 4. Let’s also try MA(4).

model <- arima(random, order=c(0,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(0, 0, 4))
## 
## Coefficients:
##          ma1      ma2      ma3      ma4  intercept
##       0.0451  -0.3005  -0.4005  -0.3441     0.0780
## s.e.  0.0461   0.0457   0.0431   0.0512     0.6099
## 
## sigma^2 estimated as 202322:  log likelihood = -2991.02,  aic = 5994.03

As the AIC is lower than the first one this model is better.

We need to decide their combination and try the neighbours to decide which model to use.

model <- arima(random, order=c(4,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(4, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3     ar4     ma1      ma2      ma3     ma4
##       0.2376  0.1942  -0.0391  -0.401  -0.244  -0.5604  -0.3554  0.1598
## s.e.     NaN  0.5659      NaN     NaN     NaN   0.5899      NaN     NaN
##       intercept
##          0.0598
## s.e.     0.3341
## 
## sigma^2 estimated as 183334:  log likelihood = -2972.15,  aic = 5964.3

This is the lowest AIC, thus the best model. Trying the neighbors:

model <- arima(random, order=c(3,0,4))
print(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3      ma1      ma2     ma3     ma4  intercept
##       0.6368  0.6023  -0.6433  -0.7536  -1.0818  0.5753  0.2601     0.0704
## s.e.     NaN     NaN      NaN      NaN      NaN     NaN  0.0237     0.0767
## 
## sigma^2 estimated as 171534:  log likelihood = -2960.91,  aic = 5939.82

Using AR(3) increased the performance. Checking for MA(3):

modelf <- arima(random, order=c(4,0,3))
print(modelf)
## 
## Call:
## arima(x = random, order = c(4, 0, 3))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3     ar4      ma1      ma2     ma3  intercept
##       0.9752  0.3597  -0.8759  0.2881  -1.0612  -0.7895  0.8508      0.074
## s.e.     NaN     NaN      NaN  0.0329      NaN      NaN     NaN      0.073
## 
## sigma^2 estimated as 168325:  log likelihood = -2957.1,  aic = 5932.2

This is the lowest AIC. We will continue our model with (4,0,3)

To understand the the the possible regressors that can be used in the ARIMA model, the correlations with both detrend data and sold_count is important.

Than, the lagged variable will be addded to the model as for the predictions we need to have the data.

tayt <- tayt[, random := data_add$random]
numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   397 obs. of  12 variables:
##  $ price              : num  45 45 45 45 45 ...
##  $ sold_count         : num  366 1188 1162 895 649 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  2293 6379 5640 4200 3084 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  836 1581 1530 1353 868 ...
##  $ category_visits    : num  7436 7736 9176 8896 7154 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  35389 36277 42223 38858 30893 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -563 604 665 316 -162 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The biggest correlation with the data is basket_count and category_sold. To use them in the regression, we are adding to the model with the lagged versions.

tayt <- tayt[,category_sold_lag := shift(category_sold, 2)]
tayt <- tayt[,basket_count_lag := shift(basket_count, 2)]
tayt <- tayt[,favored_count_lag := shift(favored_count, 2)]
tayt <- tayt[,price_lag := shift(price, 2)]

numeric_data <- tayt
numeric_data$event_date = NULL
numeric_data$product_content_id = NULL
numeric_data$trend= NULL
## Warning in set(x, j = name, value = value): Column 'trend' does not exist to
## remove
numeric_data$month= NULL
numeric_data$wday= NULL
numeric_data$random= as.numeric(numeric_data$random)
numeric_data <- na.omit(numeric_data)
str(numeric_data)
## Classes 'data.table' and 'data.frame':   397 obs. of  16 variables:
##  $ price              : num  45 45 45 45 45 ...
##  $ sold_count         : num  366 1188 1162 895 649 ...
##  $ visit_count        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ basket_count       : num  2293 6379 5640 4200 3084 ...
##  $ favored_count      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_sold      : num  836 1581 1530 1353 868 ...
##  $ category_visits    : num  7436 7736 9176 8896 7154 ...
##  $ category_basket    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_favored   : num  35389 36277 42223 38858 30893 ...
##  $ category_brand_sold: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ty_visits          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ random             : num  -563 604 665 316 -162 ...
##  $ category_sold_lag  : num  953 777 836 1581 1530 ...
##  $ basket_count_lag   : num  2226 1754 2293 6379 5640 ...
##  $ favored_count_lag  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ price_lag          : num  45 45 45 45 45 ...
##  - attr(*, ".internal.selfref")=<externalptr>
correl_info = cor(numeric_data)
ggcorrplot(correl_info, hc.order = TRUE, type = "lower",lab = TRUE)

The basket_count with lag have the highest autocorrelation, that’s why start with it.

reg_matrix=cbind( tayt$basket_count_lag) # can add more if any other regressors exist
   model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
  summary(model)
## 
## Call:
## arima(x = random, order = c(3, 0, 4))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3      ma1      ma2     ma3     ma4  intercept
##       0.6368  0.6023  -0.6433  -0.7536  -1.0818  0.5753  0.2601     0.0704
## s.e.     NaN     NaN      NaN      NaN      NaN     NaN  0.0237     0.0767
## 
## sigma^2 estimated as 171534:  log likelihood = -2960.91,  aic = 5939.82
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE    MAPE      MASE       ACF1
## Training set -4.901652 414.1668 227.7179 97.47823 154.436 0.7217732 0.02486137

The AIC remain the same, we can say that the model neither improved nor worsened.

We add the other potential regressor to the model.

reg_matrix=cbind(tayt$category_sold_lag) 

   model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = tayt$random, order = c(4, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3     ar4      ma1      ma2     ma3  intercept
##       0.9173  0.4814  -0.9728  0.3166  -1.0111  -0.9191  0.9305     -0.844
## s.e.     NaN     NaN      NaN     NaN      NaN      NaN     NaN        NaN
##       reg_matrix
##            4e-04
## s.e.         NaN
## 
## sigma^2 estimated as 165884:  log likelihood = -2954.36,  aic = 5928.72
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
## Training set 8.982585 407.2884 226.2653 103.9454 162.7711 0.7171692 -0.01527941

The AIC is lowered, we can use the category_sold variable as lag

Adding the price:

reg_matrix=cbind(tayt$category_sold_lag) 

   model1 <- arima(tayt$random,order = c(4,0,3), xreg = reg_matrix)
  summary(model1)
## 
## Call:
## arima(x = tayt$random, order = c(4, 0, 3), xreg = reg_matrix)
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2      ar3     ar4      ma1      ma2     ma3  intercept
##       0.9173  0.4814  -0.9728  0.3166  -1.0111  -0.9191  0.9305     -0.844
## s.e.     NaN     NaN      NaN     NaN      NaN      NaN     NaN        NaN
##       reg_matrix
##            4e-04
## s.e.         NaN
## 
## sigma^2 estimated as 165884:  log likelihood = -2954.36,  aic = 5928.72
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
## Training set 8.982585 407.2884 226.2653 103.9454 162.7711 0.7171692 -0.01527941

To see ARIMA without lags and with lag in the same graph:

model_fitted <- tayt$random - residuals(model1)
tayt <- cbind(tayt, data_add$seasonal, data_add$trend, model_fitted)
tayt <-tayt[,predictarima := data_add$seasonal + data_add$trend + model_fitted  ] 

model_fitted_2 <- tayt$random - residuals(modelf)
tayt <- cbind(tayt,model_fitted_2)
tayt <-tayt[,predictarima_no_reg := data_add$seasonal + data_add$trend + model_fitted_2  ] 

ggplot(tayt, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "sold_count")) + 
  geom_line(aes(y = predictarima, color = "only_arima_prediction")) + 
   geom_line(aes(y = predictarima_no_reg, color = "arima_with_lag")) + ggtitle("Sales of Leggings") + xlab("day") + ylab("Sales")
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

The two models yield pretty similar results and both seem to catch the trend except the peak points.

To predict the data:

train_start=as.Date('2021-01-23')
test_start=as.Date('2021-06-23')
test_end=as.Date('2021-06-30')

test_dates=seq(test_start,test_end,by='day')

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- tayt[event_date<=current_date,]
    
    tayt_ts <- ts(past_data$sold_count, frequency = 7)  
    tayt_decomposed <- decompose(tayt_ts, type="additive")
    model <- arima(tayt_decomposed$random,order = c(4,0,3),xreg = past_data$basket_count_lag)
    
    forecasted=predict(model,n.ahead = 2,newxreg = tayt[event_date==test_dates[i],basket_count_lag])
    tayt[nrow(tayt)-length(test_dates)+i-2, Model_reg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
  }
  m_with_7<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_reg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))  

For the data with no regressors:

for(i in 1:length(test_dates)){
    
    current_date=test_dates[i]-2
    
    past_data <- tayt[event_date<=current_date,]
    
    tayt_ts <- ts(past_data$sold_count, frequency =7)  
    tayt_decomposed <- decompose(tayt_ts, type="additive")
    model <- arima(tayt_decomposed$random,order = c(4,0,3))
    
    forecasted=predict(model,n.ahead = 2)
    tayt[nrow(tayt)-length(test_dates)+i-2, Model_noreg := forecasted$pred[2]+tayt_decomposed$seasonal[(nrow(tayt)-length(test_dates)+i-2)%%7+7]+tayt_decomposed$trend[max(which(!is.na(tayt_decomposed$trend)))]]
  }
  m_with_no_reg<-accu(tayt$sold_count[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)],(tayt$Model_noreg[(nrow(tayt)-1-length(test_dates)):(nrow(tayt)-2)]))  
rbind(m_with_7,m_with_no_reg)
##   n mean       sd        CV       FBias      MAPE     RMSE      MAD      MADP
## 1 8  257 42.77182 0.1664273 -0.19691640 0.6108043 157.2306 151.4369 0.5892486
## 2 8  257 42.77182 0.1664273 -0.09805438 0.5866824 153.3859 146.6330 0.5705563
##       WMAPE
## 1 0.5892486
## 2 0.5705563

Arima with no reg performed better in terms of the performance measures WMAPE, MAPE, MAD is all lower with ARIMA with no regressors.

3.10.2 Manuel Price Adding Approach

As For another approach of the Leggings, we have built a model that allows us to enter the current price of the leggings. It is hard to test this metholody with above methods, as we have an oppurtunity to include the discounts applied to the basket.

We added

tayt_reduced <- na.omit(data_tayt)

Adding related variables one by one until adjusted R square did not improve:

model = lm(sold_count~ basket_count_lag + price, tayt_reduced)
summary(model)
## 
## Call:
## lm(formula = sold_count ~ basket_count_lag + price, data = tayt_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3108.1  -336.7  -125.9   161.3  9320.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1796.36604  381.67298   4.707 3.49e-06 ***
## basket_count_lag    0.14080    0.01253  11.236  < 2e-16 ***
## price             -30.77352    7.31571  -4.206 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 896.5 on 394 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3024 
## F-statistic: 86.81 on 2 and 394 DF,  p-value: < 2.2e-16
model = lm(sold_count~ basket_count_lag + favored_count_lag + price, tayt_reduced)
summary(model)
## 
## Call:
## lm(formula = sold_count ~ basket_count_lag + favored_count_lag + 
##     price, data = tayt_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3016.7  -336.5  -126.8   166.8  9347.2 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1878.67363  412.46407   4.555 7.01e-06 ***
## basket_count_lag     0.13722    0.01425   9.631  < 2e-16 ***
## favored_count_lag    0.01740    0.03287   0.529    0.597    
## price              -32.55286    8.05734  -4.040 6.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 897.3 on 393 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3011 
## F-statistic: 57.86 on 3 and 393 DF,  p-value: < 2.2e-16

Adjusted R squared did not improve here, so we exclude it.

model1 = lm(sold_count~ basket_count_lag + category_sold_lag + price, tayt_reduced)
summary(model1)
## 
## Call:
## lm(formula = sold_count ~ basket_count_lag + category_sold_lag + 
##     price, data = tayt_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3124.3  -339.5  -142.0   194.5  9420.2 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2164.23883  405.18625   5.341 1.57e-07 ***
## basket_count_lag     0.11650    0.01564   7.451 5.95e-13 ***
## category_sold_lag    0.06252    0.02436   2.567   0.0106 *  
## price              -38.96236    7.93400  -4.911 1.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 890.2 on 393 degrees of freedom
## Multiple R-squared:  0.3173, Adjusted R-squared:  0.3121 
## F-statistic: 60.89 on 3 and 393 DF,  p-value: < 2.2e-16

This is the final model.

Checking residuals:

plot(model1$residuals)

checkresiduals(model1$residuals) 
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Residuals seems normal and their mean is around zero. It looks randomized except the outlier peak.

Making the prediction:

modellastts <- ts(model1$residuals)

modelarima1 <- arima(modellastts, order=c(4,0,1))
print(modelarima1)
## 
## Call:
## arima(x = modellastts, order = c(4, 0, 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ma1  intercept
##       0.3678  -0.0002  0.0752  -0.2623  0.2655    -0.9973
## s.e.  0.1698   0.1163  0.0586   0.0483  0.1754    54.5999
## 
## sigma^2 estimated as 495301:  log likelihood = -3166.58,  aic = 6347.17
AIC(modelarima1)
## [1] 6347.166
BIC(modelarima1)
## [1] 6375.053
auto.arima(modellastts)
## Series: modellastts 
## ARIMA(5,0,0) with zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5
##       0.6376  -0.1647  0.1184  -0.3015  0.0875
## s.e.  0.0500   0.0574  0.0576   0.0572  0.0498
## 
## sigma^2 estimated as 500660:  log likelihood=-3166.21
## AIC=6344.42   AICc=6344.63   BIC=6368.32
predicted1 = modellastts - modelarima1$residuals 

tayt_reduced = tayt_reduced[, predictedlinear := tayt_reduced$sold_count - predicted1]

ggplot(tayt_reduced, aes(x=event_date)) + 
  geom_line(aes(y = sold_count, color = "Actual Sales")) + 
  geom_line(aes(y = predictedlinear, color="Prediction")) + scale_x_date(date_breaks = "1 day", date_labels = "%d") + ggtitle("Sales of Leggings") + xlab("day") + ylab("Sales")

The data seems well fitting.

The prediction in this case allows us to manuel entry. As an example 56 is written. Each day expected price with all discounts have estimated and the result obtained from this model, impact the final estimation.

model_forecast <- tail(predict(modelarima1, n.ahead = 2)$pred,1)

x <- cbind(tail(tayt_reduced$basket_count,1),tail(tayt_reduced$category_sold,1),56)
x <- as.data.table(x)
names(x) = c("basket_count_lag",  "category_sold_lag", "price")

p <- predict(model1, x) + tail(predict(modelarima1, n.ahead = 2)$pred,1)
print(p)
## Time Series:
## Start = 399 
## End = 399 
## Frequency = 1 
##        1 
## 537.9064

Thanks to this method, we have interpret the effect of price and by multiplying the other methods with these, obtain better results.

4 Conclusion

In this project, we have tried to estimate the daily sales of nine different product which are sold through Trendyol. Various approaches are used in the analysis such as dynamic regression, decomposition and sarima models. During the analysis, due to data availability we had to predict two days ahead which decreased the accuracy of our forecasts. Also, part of the data were missing. Thus, we were unable to these data as external regressors. In addition, we did not have the information about the future discounts or stockouts. When there was a discount or a stockout of a product our predictions became less accurate. Having access to these information would have greatly increased our prediction accuracy. Finally, as an improvement, generalized additive models, which are very strong in forecasting seasonal series, can be used to model the sales data. This approach would have generated more accurate predictions.

5 References

  1. D. Wei, P. Geng, L. Ying and L. Shuaipeng, “A prediction study on e-commerce sales based on structure time series model and web search data,” The 26th Chinese Control and Decision Conference (2014 CCDC), 2014, pp. 5346-5351, doi: 10.1109/CCDC.2014.6852219.
  2. Singh, K., Booma, P. M., & Eaganathan, U. (2020). E-Commerce System for Sale Prediction Using Machine Learning Technique. Journal of Physics: Conference Series, 1712, 012042. https://doi.org/10.1088/1742-6596/1712/1/012042
  3. Maobin Li, Shouwen Ji, Gang Liu, “Forecasting of Chinese E-Commerce Sales: An Empirical Comparison of ARIMA, Nonlinear Autoregressive Neural Network, and a Combined ARIMA-NARNN Model”, Mathematical Problems in Engineering, vol. 2018, Article ID 6924960, 12 pages, 2018. https://doi.org/10.1155/2018/6924960